The Future of Data Lineage — Beyond a Diagram
For most, the term data lineage conjures up thoughts of a lineage diagram, with nodes and connections between those nodes. You know, a bit like a family tree, if you had relatives called SALES_QUANTITY and PRODUCT_KEY. This is basically how we saw lineage, until countless conversations with data engineers made us realise this is just the jumping-off point.
Initially, we were surprised by just how triggering the term ‘data lineage’ was. Most data engineers had thought about the topic, and plenty were actively exploring how they could implement it. This was great validation for Alvin; we’d identified a near universal pain felt in the data community — hurrah!
If only it was that simple. The more we talked the more pains we uncovered. These were experienced in varying degrees depending on a myriad of factors including role, company, regulatory environment and tech stack.
It dawned on us that data lineage itself is not the pain at all; it’s the technology. Asking is data lineage a problem for you? is really no different to asking are graph databases a problem for you? To us, data lineage is a real time data structure that maps dependencies between columns, tables, dashboards, jobs and people. It’s a dataset like any other, and diagrammatically is only one way to represent it.
Part of why current tooling has only scratched the surface when it comes to data lineage, is that it has been part of a top down approach to data governance. Technology has been pushed on data engineers, at times creating more headaches than it solves. We’ve found little love for enterprise data catalogs in the data community.
In reality, it is the data engineers who face the challenges of ensuring good governance every day. By truly understanding the technical use cases, data governance best practices can be integrated into their workflows, sparking joy instead of dread. Let’s now discuss some of the top use cases for data lineage that come up in our conversations with data engineers, before revisiting the central question of whether a diagram is the best way to address them.
Without leveraging lineage data, making data infrastructure changes (even in a relatively simple environment) was cited as a huge pain. Something as innocent as adding a column to a table can have significant downstream implications. In one company we spoke to, this had caused a metric used in financial reporting to be out for months (yikes!). To try and avoid this, manually inspecting SQL logs, Airflow DAGs, LookML files and Github files is common. Not only is this incredibly tedious and a huge waste of a highly skilled engineer’s time, it’s also imperfect. An anxious wait for some Slack backlash after making changes is the norm. Or even worse, the CFO showing up at your desk!
From our conversations, we even began to categorise different personas amongst data engineers. The ‘Mapper’ meticulously inspects logs and maps out downstream dependencies before making any changes. The ‘Dodger’ avoids making changes unless absolutely necessary. The ‘Hoper’ is so loaded up with other tasks that they go on what’s in their heads, and hope for the best. As you can see from our ratings, none are ideal:
Most data related questions (and complaints) are directed at data engineers. Why is this dashboard broken? Where is the latest data? Can you check if this metric is correct? Even after performing impact analysis, data pipeline errors can happen. The data consumer will often spot them, and it’s down to the data engineer to investigate.
Another layer to this, is that often after investigating data engineers found no error at all. For example, a marketer who thought a metric was out when a campaign had simply underperformed. Or an analyst asking for the most recent data before the scheduled job had run. With many data consumers comes great responsibility.
Prioritising these ad-hoc tasks on the fly, alongside their core work, without compromising on security and privacy, is no walk in the park for data engineers. Without lineage data, tracing the problem is extremely manual, and is mainly achieved through a combination of prior knowledge and guess work; a particular nightmare for new joiners.
Data asset clean-up
When it comes to their analytics environment, data engineers consistently expressed a lack of visibility into what data assets were in use. Without regular attention, unused tables and dashboards can pile up fast.
‘Single use’ tables for ad-hoc analysis pop up faster than they can be spotted and removed, a bit like playing Whac-A-Mole blindfolded. Dashboards created for campaigns will often have a limited lifespan, but who is going to think to delete them when they cease to be useful?
This is why people need to be included in any lineage graph. Before deleting a column or table, not only do data engineers need to know what data assets depend on it, but also what people. In the data warehouse this could be ad hoc queriers, in the BI tool dashboard viewers. This context helps data engineers declutter the analytics environment, and feel totally zen.
Within this category we identified a couple of distinct use cases:
- Periodic clean-up: like a spring clean, data engineers want to get rid of ‘stale’ assets in one go.
- Asset lookup: a data engineer wants to know if a specific data asset is in use, and how that usage has changed over time.
Helping data consumers understand whether they can trust data has generally been handled on an ad hoc basis by data engineers. Common questions here may include How is this field calculated? Should I use the VALUE or the PRICE field? Where does this data come from? These sorts of questions are most common when working with new datasets.
Open source data catalogs such as Amundsen (Lyft) and DataHub (Linkedin) were driven by the fact there was a ton of useful data that wasn’t getting used because people had no way of finding it. But new datasets aren’t so useful unless they can be trusted.
Data catalogs solve this through metadata (owners, descriptions, ratings etc.). But the source of the data, how fields are calculated and how they are used are regularly sighted as the most important factors for trust. Lineage data can certainly help trust become more self-serve, and reduce the burden on data engineers to perform this role.
Data engineers describe privacy as a nagging, stress-inducing concern that they have to learn to live with. Much of the work needed to comply with recent, more stringent privacy regulations, such as GDPR in Europe and CCPA in California, falls on their shoulders.
In a modern data environment, sensitive data will (by design) trickle from source to many different tables, dashboards, notebooks, spreadsheets, machine learning models and more. Manually keeping track of where PII ends up with 100% accuracy is close to impossible.
To take a concrete example, if a customer exercises their right to erasure under GDPR, how would you go about tracing and deleting all of their data? With the fines and reputation risk at play, there isn’t any room for error. Lineage technology can be hugely valuable here, tracking sensitive data wherever it flows, even to the darkest corners of the data warehouse.
The future of data lineage
To us, it’s pretty clear that when applying data lineage technology to each of these use cases separately, the solution is going to be different. Attempts to solve them all at once in a diagram tend to look like a handful of spaghetti thrown at the wall; which of course solves nothing.
Let’s zoom into the first use case we looked at: impact analysis. Generally, what data engineers want to know is, what are the downstream consequences of executing this SQL statement? Now, does switching away from your query editor, searching through a data catalog, and trying to work out the impact from a complex lineage diagram seem like the optimal solution? You probably figured from the framing of the question that we don’t think so!
As automation is becoming more prevalent in all aspects of software engineering, it’s natural that data engineering picks up the same practices; from DevOps to DataOps. High quality lineage data will enable similar use cases in data engineering. Companies already run tests for data pipelines, as well as QA testing, and we see some exciting possibilities here when it comes to impact analysis.
By isolating downstream dependencies for a given column, we’ve gone some way to simplifying impact analysis in our beta. We’re pleased that this is proving useful, although we still have some way to go to fulfil the ambitions laid out in this article. But with a community of data engineers constantly pushing our thinking forward, we’re optimistic we can get there.
In the vast majority of cases, the human component of data governance is the data engineer. Data lineage technology has the potential to power a whole host of features that augment and even replace parts of their workflow, allowing them to focus their time on value-creating activities. Governing data is becoming increasingly complex, but it’s our belief that use-case driven tools powered by lineage technology will help data engineers meet the challenge.
Alvin has recently launched a 6 month private beta focused on impact analysis, problem tracing and data asset clean-up. It’s the starting point of applying our data lineage technology to the use-cases outlined in this article. We’re committed to building Alvin with the input of the data engineering community. Beta members will be invited to our Slack group, and we’ll decide together what to build next.
 Data savvy engineer? We’re hiring!
 Further reading: We need to talk about data governance