What is Data Lineage?
It is the process of understanding, documenting, and visualizing the data from its origin to its consumption. This life cycle includes all the transformations done on the dataset from its origin to its destination. Data lineage gives a better understanding to the user of what happened to the data throughout the life cycle also. It also enables companies to trace errors, implement changes in the process, and implement system migration to save time and resources for efficiency.
Another process to it combines data discovery and the use of a Data Catalog that captures data asset metadata with a data mapping framework. It allows the user to look for the data in both directions (forward and backward) between the origin and to the destination of the data. It provides us with the answers for any specific dataset, such as:
- Who created the data?
- What information does the data contain?
- Where is the data located?
- When was the data created?
- Why does the data exist?
An organized record of data assets that uses metadata to help organizations manage their data. Click to explore about, Data Catalog for Snowflake
Why is Data Lineage Important for Enterprise?
The importance of Data Lineage is listed below:
For an ETL DeveloperETL stands for Extract, Transform, and Load. ETL job is a function where we need to extract data from any defined data source and put it into another location after applying some data transformation to the collected data. It can help an ETL developer to trace any bug/error within the ETL job. It also enables us to check for any changes in some of the data fields, such as column deletion, renamed, or added. It is called Impact Analysis. While dealing with complex reports, it helps in the identification of the data source which should be used in that report.
For a Data StewardTo play the role of a data steward, the person needs to know everything about the data which is being used in an organization. It helps the person to identify the least and most usable data assets in an ETL job. It provides transparency to the user who is responsible for that particular data asset.
For a Business UserData lineage helps a business user to find the reports based on any particular data field or column. Example: there is some data source that includes data fields named sales and gender if the user needs to find the reports of the bases of these data fields. It can help the business user to check whether the data is accurate or not.
For a Troubleshooting OperatorWhen we need to troubleshoot for any of the wrong reports, lineage can help us to identify which processes and jobs are involved in creating that particular report. In the case when we have some failed jobs, it can help us to find the target tables and fields affected, which are being used in the reports.
Big Data Governance is the process and management of data availability, usability, integrity, and security of data used in an enterprise. Click to explore about, Big Data Governance Tools
How do Data Lineage tools work?
Data lineage tools keep track of data throughout its lifecycle, including source information and any data transformations used during ETL or ELT procedures.
Metadata enables data lineage tool users to completely comprehend how data travels across the data pipeline. Metadata is "data about data," and it contains information on data assets such as type, format, structure, author, date generated, date edited, and file size. Data lineage tools present a complete picture of the metadata to assist users in determining how beneficial the data will be to them.
What are the 5 W’s of Data Lineage?
The 5 W’s of Data Lineage are described below:
Who is using the Data?While analyzing the data, there are lots of question which comes into the Data Analyst’s mind. One of them is who is using the data and where? When we have the visuals of the data lineage, it is easy for us to find out the answers to these questions. From it, we can track this and find out who is using this data.
When is the data created/updated?There is also some parameter that needs to define at the time of data creation. The data owner has the responsibility to store the data in the appropriate location and to grant access to the data. Knowing the owner of data is most important as it gives clarity on who is maintaining that data and to whom the user should contact in case of any problem with the correction.
What information does it contain?We always need to define some access policies to the data. And before that, it is also necessary to understand what information does the data contains. It helps in classifying the data so that we can understand which data policies need to define against the data so that we can protect our sensitive data.
How is it being used?In an organization, the data is used to create several reports. These reports are used to make decisions for the growth of the organization. These reports are created by using several datasets that are generated within the organization. The data lineage diagram can show us which datasets are being used. So in case if we get some wrong reports, this can help us to trace the source of the error if we have any.
Why is it stored/used?There is one more important question about the existence of data. Why does this data exist? This is one of the most important questions because if we don't need any data it should be deleted. The data which is no longer required can lead to unnecessary time and money. So we should know about every dataset which is stored.
Data Governance is no longer optional because it underpins data security, compliance and privacy. Click to explore about, The Evolution Of Data Governance
What are the best practices?While building the data lineage system, we need to keep track of each process within the system where we are doing some data transformation or processing. We need to map data elements at every stage when the data asset is going through any processes. So, it is necessary to track the tables, views, columns, and reports in databases, the ETL jobs. To capture it we need to collect the metadata after each of the data transformations. So metadata at each stage is collected and stored in the metadata store, which can be used for lineage representation.
Data Ingestion Lineage
Data Ingestion lineage can be used to track the complete data flow within the Data Ingestion Job. It can also be used for tracing any bug/error within the Data Ingestion job.
Below we are going to discuss the data lineage of Apache NiFi by using Apache Atlas.
Apache NiFi is a UI-based platform where we need to define our source from where we want to collect data, processors for the conversion of the data, and a destination where we want to store the data. Apache Atlas is the governance and metadata framework for Hadoop, which can be used for it. Apache NiFi also has a controller that can be configured to push metadata of the data flow to the Apache Atlas.
Data Processing LineageSpark is very popular nowadays for Distributed Processing of Data. So, When we are working with the Apache Spark Lineage, the only thing which matters is RDDs. In spark, existing RDDs point towards their parent RDDs. Consider a simple job:
- First RDD: When we read a text file and make an RDD.
- Second RDD: When we apply map operation on the first RDD.
- Third RDD: When we apply filter operation on the second RDD.
- Fourth RDD: When we apply count operation on the third RDD.
Query History LineageWhen users are querying Data Warehouse, Then they might keep on applying filters or joining the tables, etc. So, Query Lineage also becomes very necessary so that Data Engineers can observe what are the most frequent filters and joins used, and They can accordingly optimize their partitioning keys or denormalize the tables, etc, and Other optimizations as well. Example: Uber Query Parser
Data Lake and Warehouse Access Lineage
When proper Data Governance is applied on Data Lake and Data Warehouse like RBACs, Row Level & Column Level Permissions, Then Query Lineage along with MetaStore logs can help to visualize if some user is trying to access non-authorized data, and accordingly, the administration team can take action on it. Example: Apache Atlas, Cloudera Navigator.
What are the best tools for Data Lineage ?
The best tools for Data Lineage are listed below:
Data Lineage helps the user to make sure if the data is coming from a reliable data source, transformations are done appropriately and loaded correctly to the designated location. It plays an important role where key decisions rely on accurate information. Without appropriate technology and processes in place tracking, data can be virtually impossible or, at the very least, a costly and time-consuming endeavor. It enables the tracking of the data stream from both endpoints to ensure the data is accurate and consistent.