Augmented Data Lineage for Data Scientists and Beyond

Nazar Labunets
May 28 · 2 min read
Image for post
Image for post

Data lineage is a highly sought-after capability for modern data management and data governance teams. By now, it has become a critical feature of data catalogs and metadata management solutions, offering a wide range of benefits and applications. These include regulatory compliance, impact analysis, and a faster understanding of the enterprise data landscape.

Typically, data lineage is associated with technical roles, such as ETL developers and data engineers. However, when data lineage is enriched with business metadata, it can become a particularly useful and practical capability for business users and analytical roles, such as data scientists.

In this post, we’ll introduce the concept of augmented data lineage as a tool for business users. We will explore how business and analytical roles within enterprises can use it to find data and perform root cause analyses faster while avoiding corporate red tape.

What is Augmented Data Lineage?

Augmented data lineage is “regular” data lineage enriched with information from a data catalog: metadata such as real-time data quality, business terms & categories, and anomalies detected in data loads.

Enhanced with this information, data lineage can speed up the process of locating the right data or support analytical activities, such as root cause analysis or data quality analysis. The visual presentation of augmented data lineage alone makes a big difference in a user’s ability to draw conclusions, as opposed to just viewing a list of data sets on the catalog’s search results page.

Data lineage enhanced with business terms
Data lineage enhanced with business terms
Data lineage enhanced with business terms

This enriched data lineage can help answer many questions that are typically addressed with a data catalog search query or by consulting standard data lineage:

  • Is this the best data I can use for my data science project or analytic assignment?
  • Has this report been generated from valid and timely data?
  • Why does a metric in a report contain an unexpectedly large or small value?
  • Which data sets contain PII data, and in which systems do they originate?

Let’s examine how these questions can be answered by using augmented data lineage.


To learn about specific use cases of augmented data lineage for data scientists and data stewards, read the rest of this article at ataccama.com.

Ataccama

Self-Driving Data Management

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store