CSI Data: Solving the Mysteries of Your Data’s Past and Present with Data Lineage!

Published in

Data And Beyond

6 min readApr 25, 2023

If you are working in data science or data analytics teams, you need various hard and soft skills to accomplish your tasks. The most common hard skills are known as SQL, Python, Tableau, etc. On the soft skill side stakeholder management, presentation, and crossfunctional collaboration are the most required ones. It is surprising that no one pays attention to the detective skills you need!

In every CSI TV show the episode starts with a new murder case. The case is always unique, doesn’t contain any available clues, and is acute to be solved. The chief officer assigns the case to the police officers and detectives with almost no proper knowledge. These professionals seek extra information to highlight the situation, reach out to various people to generate more information, spend countless hours and days, and finally solve the case.

If we switch the chief officer with the data team lead, detectives and police officers with data scientists and data analysts, the murder case with a broken pipeline or dashboard, all the remaining items are totally identical in the data teams. This is why every data scientist and data analyst needs proper detective skills! Or we can just implement data lineage solutions to better facilitate our team members' times!

In this article, I will deep dive into data lineage, various forms of the data lineage concept, and the similarities and differences with data modeling.

What is Genealogy?

Before we deep dive into data lineage and technical topics, let’s learn what genealogy is to better understand the dynamics of data lineage later.

Genealogy is the study of someone’s family history, usually going back multiple generations. It involves researching and documenting information about a person’s ancestors, such as their names, birth and death dates, marriages, and children. By tracing family relationships over time, genealogy can help people learn more about their roots, cultural heritage, and familial connections. It can also provide insights into the historical events and social contexts that shaped their ancestors’ lives.

What is Data Lineage?

Data lineage is the process of tracking the complete life cycle of data, from its origin to its final destination. It involves identifying the different sources of data, understanding how it flows through various systems, and documenting all the transformations and changes that occur along the way. The goal of data lineage is to provide a clear and complete picture of the data’s journey, enabling organizations to ensure its accuracy, compliance, and quality. In simpler terms, data lineage is the application of genealogy to our data flows.

The Differences Between Data Lineage and Data Modeling

There is a common misconception between data lineage and data modeling. Sometimes they are used interchangeably which can be correct in some situations but they are totally different terminologies to address different cases in data management.

Data modeling, on the other hand, is the process of creating a conceptual or logical representation of data structures and relationships. It involves identifying the data entities, attributes, and relationships, and defining them in a formal language or notation, such as ER (Entity-Relationship) diagrams or UML (Unified Modeling Language) diagrams. The goal of data modeling is to create a common understanding of the data among stakeholders and to provide a blueprint for designing and implementing databases or data systems.

While data lineage and data modeling are related in that they both deal with data structures and relationships, they serve different purposes. Data lineage focuses on tracking the flow of data over time, whereas data modeling focuses on creating a formal representation of data structures and relationships.

What are the components of data lineage?

The components of data lineage typically include;

Data source: This is the original location where the data is created or first captured.
Data flow: This refers to the movement of data through different systems and processes, such as ETL (extract, transform, load) tools, data warehouses, or BI (business intelligence) applications.
Transformation: This involves any changes or modifications made to the data as it moves through the different systems. Transformations may include data cleansing, data aggregation, or data enrichment.
Mapping: This refers to the association between source and target data elements, documenting how data is transformed from one form to another.
Metadata: This is information about the data, such as data types, field lengths, or other data characteristics, that helps to describe and define the data.
Dependencies: This refers to the relationships between different data elements, including how they are used in different systems and applications.
Lineage visualization: This is the graphical representation of the data lineage, which can help users to better understand the flow of data and its relationships.

Column vs Table Data Lineage

Directed acyclic graphs (DAG) are the most used and intuitive method to visualize and the full lineage between the data sources. Depending on the problem statement we can either use column-based or table-based data lineage capabilities. The most important point is that the data lineage solution should support both cases. But what are the pros and cons of these methods?

Column-based data lineage:

Pros:

Provides more granular detail on individual columns and their transformations.
Enables easier tracking of specific data elements or attributes.
Helps identify data quality issues at a more granular level.
Can help identify data lineage dependencies between specific columns and transformations.

Cons:

Can be more complex and time-consuming to create and maintain.
May require more detailed documentation and metadata.
Can be more difficult to visualize and interpret for non-technical stakeholders.

Table-based data lineage:

Pros:

Provides a higher-level view of the data, making it easier to understand for non-technical stakeholders.
Can be easier and quicker to create and maintain.
Provides a more holistic view of the data and its relationships between tables.
Enables easier tracking of table-level data quality issues.

Cons:

May not provide enough detail for a more granular analysis.
May not capture specific transformations or dependencies between columns.
May require additional metadata and documentation to fully understand the data lineage.

Ultimately, the choice between column-based and table-based data lineage will depend on the specific needs and priorities of the organization. Both approaches have their strengths and weaknesses, and organizations may choose to use a combination of both depending on the specific use case.

Conclusion

Data lineage is an essential aspect of data science and data analytics that enables organizations to understand the journey of data from its origin to its destination. It is a process of tracking the origin, movement, and transformation of data through various stages of its lifecycle. Data lineage plays a critical role in enhancing the performance and productivity of data science and analytics teams.

By establishing a clear data lineage, data scientists and analysts can easily identify the source of data, its quality, and its dependencies. This information is crucial in ensuring data accuracy, reliability, and consistency. Additionally, data lineage helps teams quickly identify errors or issues in the data processing pipeline, enabling them to take corrective action promptly.

Moreover, data lineage promotes better collaboration among team members by providing a shared understanding of the data ecosystem. This shared understanding ensures that everyone involved in the data analysis process is on the same page, reducing confusion and errors caused by miscommunication.