What is Data Catalog?
Data Catalog provides a single self-service environment to the users which helps them to find, understand and trust the data source. It also helps the users to discover the new data sources if there is any. Discovering and understanding data sources are the initial steps for registering the sources. Users search for the data on the basis of the needs and filter the results for the appropriate results. In Enterprises, Data Lake needed for Business Intelligence, Data Scientists, ETL Developers where the right data needed. Catalog discovery used by the users to find the data which fit their needs.Key capabilities of a successful Data Catalog
- Search and discovery
- Data Intelligence
- Manage business glossary
- Empowering Confidence in data
How Data Catalog Works?
Building a Catalog starts from collecting the metadata of the sources. After obtaining the metadata, the metadata entities need to categorize and assign different tags. ML (Machine Learning) and NLP (Natural Language Processing) used to automate these processes. Metadata entities auto assign a tag according to the name of the data entity with the help of Machine Learning model. In the end, the data steward reviews the things and add more values to Data Catalog.
Data Catalog Benefits
Spend more time using the data not finding – As per Forrester Forbes report, data scientists spent than 75% of their time to understand and find the data. And more than 75% of them doesn’t like that part of their job. This due to the questions which they have before working on the queries. The main reason for this problem in an organization is the poor mechanism of handling and tracking all the data. A good Catalog helps the Data Scientist or Business Analyst to understand the data and to answer the question which they have.
To implement Access Control – When an organization grows, role-based policies needed, don’t want everybody to modify the data. Access Control should be implemented while building the Data Lake. Particular roles assigned to the users and according to those roles, Data Access should be controlled. In Hadoop ecosystem, implement using Apache Ranger. For the sensitive data in the Data Lake, use encryption for the Data Protection.
To Reduce Cost by Eliminating Data Redundancies – A good Catalog helps us to find the data redundancies and eliminated. This can help us to save the storage cost and data management costs.
To follow Laws – There are different protection laws to follow as per the data such as GDPR, BASEL, GDSN, HIPAA and many more. These laws must be followed while dealing with any data. But these laws stand for different use-cases and doesn’t implies every data set, to understand that we need to know about the data set. A good Catalog helps us to make sure that Data Compliance’s followed by giving a view on Data Lineage and to use Access Control.
Why Data Catalog Matters?
Helps in Understanding the data – A good Catalog helps the user in understanding the data.A catalog makes it easier for the user to find the relevant data and know the data, it also gives information about the data such as where it is being used and from where its generating.
Allow users to work multiple data sources – catalog consists of one or more data sources. It helps users to find the quality data source and to gain better knowledge about multiple data sources.
To Follow Regulatory compliance – There are several data-related laws like HIPAA, BASEL, and GDPR. All of these laws driven from different perspectives and use cases, but in the end, they all come down to better governance of data with a focus on Data Lineage and Access Control.
HIPAA Stands for Health Insurance Portability and Accountability Act. It is specially designed for the protection of patient information.
From the Article, What is HIPAA Compliance? | HIPAA Requirements
How to Adopt Data Catalog?
Building a Catalog is a multi-step process which includes –
Metadata Extraction – Metadata extraction is the very first step of building the catalog. In this step, the metadata of a defined source is collected and stored into the metadata store. It helps in understanding the defined data asset.
Data Sampling – Data sampling used to understand the schema, tables, databases.
Auto-Titling (ML and NLP) – Every organization has a naming convention of using abbreviation to define the schema. NLP model to assign that abbreviation a common name understood by the users who are using Catalog.
Query Log Ingestion – Query Log Ingestion to collect additional information about the data sets and give a complete picture of each data set just like Data Lineage, Data Usability.
Crowd Sourcing & Expert Sourcing – Up to this layer, Catalog is ready and just need to add more values to the Data Catalog. NLP model has corrected the names of the data assets collected from the data sources but the Computer-Human Collaboration also necessary to verify the things.
Data Catalog Best Practices
Assigning Ownership for the data set – Ownership of each data set must be defined. There must be a person to whom the user contacts in case of an issue. A good Catalog also must tell about the owner of any particular data set.
Machine-Human Collaboration – After building a Catalog, the data sets verified from the users to make it more accurate.
Searchability –The Catalog should support searchability. Searchability enables Data Asset Discovery, data consumers easily find assets that meet needs.
Data Protection – Define Access policies to prevent unauthorized data access.
Data Catalog Tools
- Alation Data Catalog
- Cloudera Navigator
- Informatica Data Catalog
- Collibra Data Catalog
A Collaborative Approach
Using Data Catalogs Enterprises can centralize metadata in one spot, and enable a full representation of each part of data across databases and incorporate knowledge regarding the data’s position and statistics, for incorporating this approach we suggest taking the following steps –