The many layers of data lineage

Published in

Data @ Monzo

10 min readAug 30, 2022

Having a map showing how data evolves from its sources to its destination is the dream of any organisation. Like the gold rush, everyone is after that tool connecting together columns, tables and dashboards within the warehouse. But like gold, this visualisation has been always considered a privilege in the data ecosystem.

Defining the lineage has been a manual task not accessible to everyone. Usually, only the ones working daily with the data transformation processes are aware of the actual flow of data — and typically this lineage is a mix between what’s in their minds, documented information and digging into different tools’ metadata. Basically, you’ll end up needing help from someone else to know what’s blocking the pipeline, when the data is available, the source of the dashboard…… and sometimes it’s not straightforward to get these answers.

Fortunately we’ve slowly started to move away from this manual process to automatically extract the lineage through the metadata from our data stack, like parsing SQL files or the manifest files in dbt. This has made the tools relying on data lineage (data cataloging, data quality, data management…) join the gold rush, as this opens the door to new levels of reliability and observability within the warehouse. Basically, this automation allows users to make their own visualisations to answer questions like:

Which models have errors and what’s the downstream impact?
Who’s using this data? And this dashboard?
What is the impact of removing a column or model on the wider analytics ecosystem?
What’s delaying the execution of my models? Are there any bottlenecks? Are there any upstream blockers?

Unfortunately, with these questions, data lineage can become overwhelming if you don’t know what you are looking at — there are too many layers affecting the different abstractions of data, and it can be difficult to understand when data (and the organisation) scales. Moreover, for each one of the previous questions, we might need different tools (or people!) offering very specific answers — and the so-called modern data-stack tool box starts to get overloaded. If we want to increase data literacy and provide a safe self serve environment, we need to move away from so many dependencies on people or tools when we try to understand how data evolves.

How can we create a single interface that everyone can consume and understand? A map where we can navigate through our data assets without having to navigate first between loads of tools and colleagues? In this post we’ll discuss how we can learn from the field of cartography and Google Maps to extract the untapped potential of data lineage, and build this ideal interface to improve data literacy and observability.

🗺 Data cartography — using maps to navigate through data

Cartography is the art and science of making and using maps to communicate spatial information effectively, but maps aren’t one-size-fits-all. Different map types are better suited to displaying different types of information. Depending on our needs, we may be interested in maps depicting either the elevation, the rivers, or the roads of a particular region.

However, most of the time maps are not just one layer of data. When we look at a map, there are multiple layers of information on top of each other. There is an overlay where two or more different thematic maps of the same area are placed on top of one another to form a new map. Questions like Where should we build our next shopping centre? can involve overlaying pertinent data layers, such as available development zones, population density, adequacy of transportation systems, elevation of the land and so forth.

Left image by source, right image by Borja Vazquez (author).

When we talk about the different layers of observability in data lineage, we are in a very similar scenario . It seems that there’s no single visualisation giving us an holistic view in an accessible way. But there’s one common pattern: all the data lineage layers are just different themes of the same area. So if we want to analyse the impact of changing a column, we can overlay information about the downstream columns that are going to change, whether we are affecting any fact tables, and if we need to inform any business areas impacted by the change based on the ownership of the data assets.

If we stack all the previous layers together, we end up with a representation very similar to something we are familiar with: Entity Relationship Diagrams. We’ve been using them for years to design and model our warehouse, how columns and tables are connected, what’s the flow of data, business logic. they are the rosetta stone for warehouse designing, but unfortunately, they are really hard to maintain. But even though we’ve started to automate their generation with tools like dbt, these diagrams are now more complex than ever and hundreds of models and different connections have led to a situation where we feel short to fully understand and maintain them. There’s no easy way to visualise the different roads data can take within the warehouse.

Like in the old days without Google Maps, if we wanted to travel between cities, we had to rely on a mental representation of multiple layers stacked together using paper maps, landmarks, radio reports on accidents, or friends suggesting shortcuts. This is the situation we are at right now when trying to understand all data connections. We still need to mentally merge multiple sources together to fully picture how data moves and evolves: Slack notifications and alerts, dbt docs, that last commit, dependencies, dashboards, metrics…

If we want to reduce all this overhead, we need to start considering Data Lineage Maps as first class citizens in our day to day work as a single source of truth everyone can consume, analyse and understand without the burden of having to learn new tools or asking colleagues. This is where we can learn a lot from Google Maps and graph theory.

🔖 Learning from Google Maps — Custom layers

During the pandemic, Google released a COVID layer in Maps, showing critical information about positive cases in an area so you could make more informed decisions about where to go. But this was not a new functionality for Maps . Over the years, they also released a layer to track the movement of wildfires, a new layer to track the air pollution levels, or a platform were anyone could add custom layers to google Maps, and let’s not forget about classics like the traffic layer, or what areas or shops are busier than usual. The common thing with all these new features? All of them are a new layer on top of the good old Google Maps . There’s nothing new for us to learn on how to use it; just a few new clicks, and the info overlays on top of what we already know.

COVID layer showing positive cases in Europe. In the bottom left corner a pop up with different layers Google offers.

In the data world, we’ve been relying on Direct Acyclic Graphs (DAGs) as paper maps to understand the different roads data can take within the warehouse. However, current visualisations are usually a static representation of data dependencies, and even though we’ve got all the data that makes these maps alive (tests results, execution times, resources consumption, access history…) we still need to mentally workout how the DAG and the metadata ties back together. What if we could visualise traffic jams between models the same way as Google Maps does between cities? It feels natural that it should be possible to overlay all the metadata collected on top of a DAG.

Performance layer

What’s making this reporting model finish so late?
What’s the delay introduced to reporting models if this model is merged in production?

Execution times, tasks duration, costs, resources consumption is key to understanding how data pipelines perform. By using this data it’s possible to extract metrics pointing at the slowest models, or the ones that are consuming the most resources. However, these metrics are usually reported in isolation . If a model takes longer to run they don’t give a view on what’s the actual impact in the whole pipeline.

Overlaying performance data over a DAG can provide a much richer understanding on how single tasks can affect the whole data pipeline performance. The next image shows an example of how a new visualisation layer can help everyone understand what’s impacting the delivery time of a model: if any task within the highlighted path changes duration, the final model is going to be affected. To build this new layer we only need the execution times of each task, and a little bit of graph theory. In future blog posts we’ll share how to build this type of visualisation.

Each task shows the amount of time they are adding to the final execution of fct_trading_pnl. In this case, even tough manual_book2 only takes a minute to run, it starts later than the rest of the models, adding a lot of overhead in the landing time of the final model. The Y axis represents a timeline to visualise the execution timeline.

Usage layer

Are there any unused models in the warehouse?
Is this column being used at all?

Having a clear understanding of who and how data is being used is one of the first steps to keeping a healthy and fit to purpose warehouse. Unfortunately, with data and teams constantly growing, this becomes a really difficult task with one-off experiments creeping into production or models not being fit for purpose anymore.

However, by using audit logs, it’s possible to extract and overlay all the access and usage information over the lineage graph. This could bring back that holistic view to the team by highlighting which data assets are no longer being used.

Model usage at column level — Each node represents a model with information about how many of the columns where queried in the last 90 days. The gradient of the columns represent how many columns are being unused, being green a 100% usage and ref over less than 60%.

Data quality layer, and many more!

What’s blocking the execution of my models?
Which models are lacking tests? And documentation?

In general, model tests and failures are reported in isolation, but reality is that a model failing or not passing a test will have an effect in the whole warehouse. For example, if a model is blocked by upstream dependencies, stakeholders should be able to trace down the blockers to better understand the impact without having to ask engineers or dive into technical tools like Airflow.

Data quality layer showing a failed model (in red) and what models downstream (in orange) are blocked because of it.

But data quality is not only about how many tests a model has passed. It’s the measure of how well suited a data set is to serve its specific purpose. We should think about the data quality layer as a set of layers portraying how good a model is: Are the tables and columns documented? Are the data models meeting their services of agreement? Do they have a defined owner? All these metrics can be colour coded and reported as part of a new layer giving us a proper understanding on the true quality of the warehouse.

🔎 Zooming into the right granularity

One of the main limitations of data lineage is the visualisation of very dense graphs involving hundreds of nodes. We usually try to reduce this granularity by creating subdags or using tag filters in dbt docs, but even with these solutions, it’s really difficult to have a full understanding of how everything’s connected. It feels like we are still unfolding an A1 road atlas, trying to figure out how cities are connected. To be effective, data lineage maps should allow different levels of granularity.

Again, Google Maps has solved this problem by helping users zoom in and out. Lower zoom levels means that the map shows entire continents, while higher zoom levels means that the map can show details of a city. With one simple action we can easily change the granularity levels of a map. If we borrow this concept into data lineage maps, lower zoom levels means that the lineage map shows the different business areas defined in the warehouse, while higher zoom levels means that the map can show details about table or even column lineage.

Following this idea, the next image shows an ideal lineage map where we can browse through the different granularity levels of data lineage, while keeping the quality layer when drilling down. In this example we don’t go into the fancy zoom in/out, but on a click and drill through functionality. With this visualisation we can navigate from the business layer that has some models failing, down through the models that are blocked, to the actual columns that have failed their tests. This example also shows how the different informational layers we’ve been describing in this post can be also aggregated into different granularities, not only at table label.

Zooming through different granularity levels, from a more abstract business layer to column lineage.

In this post we’ve shown how data lineage can help us remove plenty of data accessibility barriers through just one single interface ; a map with endless extension opportunities for self-serving.

However, there is one caveat. If we want to remove all barriers, we need to think first and foremost about data modelling, and neither data lineage or any of the new fancy tools in the modern data stack can help us here. If we don’t model and catalog the data correctly, we are basically cyphering the warehouse, where data lineage becomes a treasure map full of puzzles and dead ends , and believe me, the only treasure you can find at the end is despair 😰.