Xenonstack Recommends

Guide to Data Catalog Tools and Architecture

Acknowledging Data Management
          Best Practices with DataOps


What is a Data Catalog?

It provides a single self-service environment to the users, helping them find, understand, and trust the data source. It also helps the users to discover the new data sources, if there are any. Discovering and understanding data sources are the initial steps for registering the sources. Users search for the Data Catalog Tools based on the needs and filter the appropriate results. In Enterprises, Data Lake is needed for Business Intelligence, Data Scientists, ETL Developers where the right data needed. The users use catalog discovery to find the data which fits their needs. Key capabilities of this approach are as follows:
  • Search and discovery
  • Data Intelligence
  • Manage business glossary
  • Empowering Confidence in data

How Data Catalog Works?

Building a Catalog starts with collecting the metadata of the sources. After obtaining the metadata, the metadata entities need to categorize and assign different tags. ML ( Machine Learning) and NLP ( Natural Language Processing) are used to automate these processes. Metadata entities auto-assign a tag according to the name of the data entity with the Machine Learning model's help. In the end, the data steward reviews the things and adds more value to Data Catalog.
A successful digital transformation strategy—built on new thinking and new technology—paves the way to success in this hyper-competitive economy. Source- Intelligent Data Catalogs

Data Catalog Benefits

Spend more time using the data not found - As per Forrester Forbes report, data scientists spent than 75% of their time understanding and finding the data. And more than 75% of them don't like that part of their job—this due to the questions which they have before working on the queries. The main reason for this problem in an organization is the poor mechanism of handling and tracking all the data. A good Catalog helps the Data Scientist or Business Analyst understand the data and answer the question they have. To implement Access Control - When an organization grows, role-based policies are needed, don't want everybody to modify the data. Access Control should be implemented while building the Data Lake. Particular roles are assigned to the users, and according to those roles, Data Access should be controlled. In the Hadoop ecosystem, implement using Apache Ranger. For the sensitive data in the Data Lake, use encryption for the Data Protection. To Reduce Cost by Eliminating Data Redundancies - A good Catalog Tool helped us find the data redundancies and eliminate them. This can help us to save storage costs and data management costs. To follow Laws - There are different protection laws to follow as per the data, such as GDPR, BASEL, GDSN, HIPAA, and many more. These laws must be followed while dealing with any data. But these laws stand for different use-cases and don't imply every data set, to understand that we need to know about the data set. A good Catalog helps us make sure that Data Compliance's followed by giving a view on Data Lineage and using Access Control.
Do have quick insight on our AWS Data Catalog, Azure Data Catalog & GCP Data Catalog

Why Data Catalog Matters?

Helps in Understanding the data - A good Catalog helps the user in understanding the data. A catalog makes it easier for the user to find the relevant data and know the data; it also gives information about the data, such as where it is being used and where it's generating. Allow users to work with multiple data sources - the catalog consists of one or more data sources. It helps users to find quality data sources and to gain better knowledge about multiple data sources. To Follow Regulatory compliance - There are several data-related laws like GDPR, BASEL, and HIPAA. All of these laws are driven from different perspectives and use cases, but in the end, they all come down to better governance of data with a focus on Data Lineage and Access Control.

HIPAA Stands for Health Insurance Portability and Accountability Act. It is specially designed for the protection of patient information.

From the Article: What is HIPAA Compliance? | HIPAA Requirements

How to Adopt Data Catalog?

Building a Catalog is a multi-step process that includes -
  • Metadata Extraction - It is the very first step of building the catalog. In this step, the metadata of a defined source is collected and stored in the metadata store. It helps in understanding the defined data asset.
  • Data Sampling - It is used to understand the schema, tables, databases.
  • Auto-Titling (ML and NLP) - Every organization has a naming convention of using abbreviations to define the schema. NLP model to assign that abbreviation a common name understood by the users who are using Catalog.
  • Query Log Ingestion - It collects additional information about the data sets and gives a complete picture of each data set, just like Data Lineage, Data Usability.
  • Crowd Sourcing & Expert Sourcing - Up to this layer, the Catalog is ready and just needs to add more values to the Data Catalog. NLP model has corrected the names of the data assets collected from the data sources but the Computer-Human Collaboration also necessary to verify the things.

Data Catalog Best Practices

  1. Assigning Ownership for the data set - Ownership of each data set must be defined. There must be a person to whom the user contacts in case of an issue. A good Catalog also must tell about the owner of any particular data set.
  2. Machine-Human Collaboration - After building a Catalog, the users must verify the data sets to make them more accurate.
  3. Searchability -The Catalog should support searchability. Searchability enables Data Asset Discovery; data consumers easily find assets that meet their needs.
  4. Data Protection - Define Access policies to prevent unauthorized data access.

Data Catalog Tools

  • Alation Data Catalog
  • Cloudera Navigator
  • Informatica
  • Collibra

A Collaborative Approach

Using Data Catalogs Enterprises can centralize metadata in one spot, and enable a full representation of each part of data across databases and incorporate knowledge regarding the data’s position and statistics; for incorporating this approach, we suggest taking the following steps -

Related blogs and Articles

AresDB - GPU Accelerated Real Time Big Data Analytics Engine

Enterprise Data Management

AresDB - GPU Accelerated Real Time Big Data Analytics Engine

What is AresDB? AresDB is a GPU-powered real-time query engine that improves uber’s existing solutions too. Uber Engineers developed a unified, simplified solution as AresDB. Real-time data analytics is now the need for every organization to track real-time metrics and monitor them for fraud detection and ad hoc specific solutions. These issues are solved with real-time analytics solutions...