Data Threads: Automated Data Lineage for a Complete Picture

Published in

IBM Data Science in Practice

4 min readAug 22, 2022

An executive’s worst, and far too common, nightmare. Days before submitting quarterly earnings, the revenue analysis from the sales team and the product team are at odds. The analyst from each team insists their dashboard paints the accurate picture as they made no mistakes when compiling the earnings numbers into their BI software of choice, but if this is true, then where did the data diverge?

Organizations have more data, more data tools, and more people involved in data consumption than ever before. As data proliferates throughout an organization, it’s undeniably more challenging to trust, govern, and secure privacy all while meeting the demands of increasingly stringent regulations. Organizations need a map that simplifies a dataset’s journey from its origin to its end use with specific detail on how it has transformed and by whom along the way.

Data Lineage in Watson Knowledge Catalog

Data lineage explained

Data lineage is a visual representation of a data set’s journey from its origin to end use. It has evolved into the main enterprise tool to understand the flow of data and the contribution of each person and program along data’s lifecycle.

Prior to automated scanners, some enterprises spent hours creating a manual data lineage. A manual lineage usually involves documenting the knowledge of application owners, data stewards, and integration specialists. This approach is obviously time intensive, can contain contradictory or missing information, and leads to teams relying on unsound lineage to make critical decisions. Moreover, manually examining code, comparing column names, and reviewing tables by hand is tedious. Due to code volumes, complexity, and the rate of change, this method quickly becomes unsustainable. Sooner or later, a manually managed lineage will fall out of sync with the actual state of the data, and the enterprise will have a lineage that cannot be trusted.

As a product of IBM’s partnership with MANTA, data engineers can scan data connections into IBM Cloud Pak for Data to automatically retrieve a complete technical lineage and a summarized view including information on data quality and business terms. IBM’s collaboration with industry leader MANTA enables users to capture data lineage consistently, accurately, and often in a matter of seconds. Simply put, every operation whether in a database, integration, or an analysis tool is reverse engineered to create a map telling the story from data source to end application. Data lineage in Cloud Pak for Data provides the deep technical lineage that data engineers need, historical versioning to view lineage changes over time, and indirect lineage to record even the most specific data operations.

Use cases

As mentioned above, data lineage is an excellent tool for pinpointing the root cause of a data issue in an end use application like a BI report. However, users also rely on data lineage for compliance, impact analysis, and architecture migrations. Data privacy regulations like GDPR continue to expand in scope and new regulations are on the rise. Many require data lineage as a first step in compliance reports. Enterprises are constantly implementing changes to their data architecture and pipelines. Without a data lineage it can be difficult or impossible to assess the impact of planned changes. Research from IBM shows that fixing a bug in production is 15x more expensive than fixing it during the implementation phase. Data lineage gives teams insight into the downstream impacts of these changes before potentially costly bugs are introduced. Customers also use data lineage to speed up the migration process while undergoing a digital transformation. Data lineage gives engineers visibility into which architectural components must be migrated at once and which need not be migrated at all. To summarize, data lineage builds trust in data when it is needed most and increases the efficiency of data engineers.

If data lineage is ignored or mapped inaccurately, decision makers lose faith in reports and models. Analysts and data scientists deserve data that inspires accurate, timely, and confident decision making. Only when one has a complete understanding of data can they rely on it.

Data lineage and a data fabric architecture

Although an automated data lineage is crucial to understand and build trust in an organization’s data, it must be used in combination with other approaches to protect, integrate, and democratize data across the enterprise. A data fabric architecture automates data discovery, governance, and consumption by simplifying data access in an organization to facilitate self-service data consumption. Recognized as a leader in enterprise data fabric solutions by Forrester, IBM’s approach to designing and implementing a data fabric is agnostic to data environments, processes, utility, and geography.

A data fabric ensures end users have access to the data they need at the exact moment it is needed regardless of where the data lives and what tool is being used. Data lineage allows organizations to understand how their data moves, transforms, and is consumed across multiple sources and tools. For this reason, data lineage is a key tool in a data fabric architecture. With this approach, the next time an executive receives conflicting reports and an engineer is tasked with tracking down the root cause of a discrepancy, the executive or data engineer can take advantage of the data lineage provided with a data fabric architecture to quickly understand and regain trust in data.

Learn more about IBM’s approach to creating a data fabric solution or how IBM Cloud Pak for Data makes this a reality.

Data Threads: Automated Data Lineage for a Complete Picture

Written by Jacob Stellon