Data Lineage: Ensuring Data Quality

Published in

DP6 US

6 min readSep 13, 2022

As the demand for data for use in reporting, analysis and insights has grown, so too the collection, storage and availability of data for analytical teams has become essential to the process of generating value for the business. However, if it is to be used as an input for studies, the data collected and made available must meet a pre-established quality standard. The adoption of processes that validate data quality brings some benefits, such as greater agility in the analytical process, as the analytical team needs to worry less about the validation and treatment of the data. Ensuring data reliability helps to avoid biases and errors when generating analysis, and Data Lineage is an important factor in this data quality assurance.

Data Lineage

Data Lineage is a term that refers to monitoring and understanding the entire data lifecycle. From the collection/acquisition of data to processing, analysis and insight generation, it is common for the collected or acquired data to undergo several transformations. The type or format may change, for example, when a customer’s phone data needs to be standardized to numeric format, eliminating characters such as — or ( ). The data can also be cross-referenced with other external or internal data sources so that it is enriched before analysis.

Data processing, transformation and crossing operations tend to be sequential, so errors in one step lead to errors in later ones. For example, if the process removes one of the numeric digits instead of special characters from a phone number, it is quite likely that the following steps of crossing and analyzing the data will be affected, interfering with the final result of the analysis. For situations like this, Data Lineage is an effective solution.

The main objective of the data lineage process is to monitor data throughout its life cycle, enabling tracking and identification of errors that occur in the process. This gives you greater freedom and security to perform system migrations and implement changes. If data is being tracked and errors are monitored, any errors in updates or new implementations can easily be identified and corrected.

Data lineage techniques

As detailed in this text by Imperva, a cybersecurity company that works with data protection, there are a number of techniques that can be used in Data Lineage:

Pattern-Based Lineage: this technique deals with the evaluation of metadata and searches for patterns that identify the same data in different life cycles. The advantage of this approach is that it does not depend on the code or technology used in the process. However, as it depends on the efficiency in finding the patterns that relate the same data to different stages of the life cycle, it may not be so accurate and can miss some connections between the data.
Lineage by Data Tagging: this technique uses the tags left by the data processing tool to track the data in all stages. This is useful for cases where the data processing tool marks the data with tags that allow its identification during the process.
Self-Contained Lineage: it is common for some organizations to have a centralized environment for storing, processing and managing data. Inside these environments, you can count on the monitoring of data natively. However, the monitoring is restricted to what happens in the environment itself.
Lineage by Parsing: according to the text, this is the most advanced form of data lineage, and the most complex solution to implement, as it depends on understanding all the technologies used during the process. It performs an automatic reading of the logic used in the data transformation.

Data Lineage during the data lifecycle

While executing the data pipeline, there are several steps where Data Lineage implementation can bring good results.

Ingestion: during data ingestion, the monitoring process can be useful for identifying errors in data transfer or in the workflows responsible for the load.
Processing: in this stage, Data Lineage can be used to track and validate the operations and transformations performed on the data, making sure that the results produced by the operations are reliable.
Data query: during the data query process, users can perform operations and cross-references on databases that generate new data. In this scenario, it is important to validate the new results obtained from the data, to ensure that any analyses or reports created are reliable.
Data Lake: here, Data Lineage can act as a tool to help with data governance and security, tracking and monitoring user access to different types and sets of data.

Commercial Value

As mentioned in this article by Dremio, the concept of data lineage may seem a bit abstract, but it can bring great commercial value to your business, for example:

Improved business performance: the reliability and quality of the data has a direct impact on the analysis delivered by the business team.
Compliance management: reduced cost of compliance with current regulations.
Better handling of evolving data sources: monitoring the entire data lifecycle allows for agile evolution and adaptation of data sources.
Reduced IT costs and risks: reduction in implementation costs and risks and changes in existing processes.

Data Lineage Tools

OpenLineage

OpenLineage is a platform for data lineage analysis that can track metadata on datasets and provide insights that identify the root cause of problems arising from processing errors.

The tool has standard connectors for the main platforms on the market, such as Spark, Airflow and dbt. Once used, connectors ensure the API call to automatically capture information about datasets, jobs and executions.

TrueDat

TrueDat offers complete data lineage and governance solutions. The tool allows an end-to-end view of the data, with business glossary, semantic mapping and lineage impact analysis. In addition, there are governance dashboards, and you can configure workflows and control and execute data quality.

Pachyderm

Pachyderm is a Data Lineage solution whose main features are automated data version control, immutable data lineage (which allows a record of all activities during the data lifecycle), and graphical visualization to assist in the design and debugging of workflows and processing. In addition, the solution also has a JupyterLab extension to provide a point-and-click interface for versioned data.

Pipeline Penguin

In DP6’s innovation center, there is a free Data Lineage solution for use in projects. Still in the development phase, the Pipeline Penguin solution offers connection support to BigQuery and allows you to extend connections to other tools. It’s possible to register assumptions and conditions the data must meet to be considered correct in the data quality validation process. You can also program the validation of these conditions at the end of each transformation or ingestion process.

Conclusion

Creating processes for ingestion, transformation and analysis requires the data to be treated and crossed. As such processes are normally sequential, errors at any stage directly affect the subsequent results and, consequently, the final business analyses. In order to solve problems with the reliability and monitoring of data, Data Lineage tools are being adopted for use in projects, increasing process reliability and quality, and providing more reliability for maintenance processes and the implementation of new data pipelines.

Profile of the Author: Lucas Tonetto Firmo | A Computer Engineer graduate of Universidade São Judas Tadeu, Lucas is passionate about Technology and its ability to transform our way of life. He worked for two years developing websites and web applications and is currently a Data Engineer at DP6.