Mapping our data journey with column lineage

Borja Vazquez
Data @ Monzo
Published in
9 min readNov 24, 2021

Data at Monzo has grown a lot in the last couple of years, not only in the number of people, but also in the number of data assets that we maintain. At the time of writing, we have over 4700 data models in our production dbt project, and over 800 views defined in Looker 🤯.

This acceleration has become challenging for us, creating some growing pains that we’ve previously discussed. In this post, I’d like to give you a flavour of the things we’re exploring within the Analytics Engineering team to keep our warehouse healthy and up to date. Specifically, we’ll talk about how we’re exploring ways to answer questions like:

  • What’s the impact of removing a column or model on the wider analytics ecosystem?
  • Is a column being used in a Looker Dashboard?
  • Can we deprecate columns or models that are no longer being used?
  • Do we have broken fields in Looker that no longer link to BigQuery?

Tools like dbt lineage, spectacles.dev, Datahub, Google Data Catalog can help us answer these questions from different perspectives, but we are a little bit more eager than just having the lineage at model level. We want to go one step further and try to get information about how our columns evolve through our models and understand which data ends up in the hands of end-users.

In this post we’ll cover how we’re testing a data lineage solution at column level using audit logs and ZetaSQL to help us become Data Cartographers, mapping out the different roads data can take within our warehouse.

🤏 Data workflow at Monzo — quick overview

First we need to explain our data stack. We’ve built our data infrastructure so that you have to take the least possible steps to go from the data model you’re looking at, to the raw events (or source of truth) for that data. We want to keep our warehouse simple and accessible for everyone — an easy self service process is our main priority. If you want to have a more detailed view of our data stack, check out Luke’s post.

Simplified version of the data journey at Monzo — from the events produced by our microservices, through to their consumption in Looker

We work with events, loads of them, and we define abstractions on top of them which we call data models. These data models are SQL queries which transform the events that we get from our microservices into reusable multipurpose tables that describe products, business processes and entities. We store all of this data, raw and transformed, in BigQuery.

To manage these transformations we use dbt. We rely heavily on it to transform, document and test all the data models that we produce. At Monzo we build data models, not queries, and dbt helps us to make sure our work is repeatable, sustainable and accessible. For context, we use the term data models to describe the output of the transformation process from analytics events or logs to higher level abstractions.

Finally, Looker works as our data visualisation tool and sits at the end of our pipeline, hooked up to the transformed data in BigQuery. Using Looker means our data is easily accessible by everyone, enabling Monzo to sustain a culture of data-driven decision making.

🧟 Our data models are alive

Schema modification of existing data assets, like deprecation and replacement of old data models, is common even in the best-defined and curated data warehouses. One of our best allies in understanding the impact of such changes is dbt — it works as our main data lineage tool, showing us all the roads data can take within our warehouse.

However, with data models at Monzo steadily growing in complexity we’re starting to get questions that are tricky to answer solely with the documentation and information that we hold in dbt. Such questions are:

  • Backend engineer : We’re planning to stop generating some events from our services — do we know who is using them?
  • Data Analyst: We’re changing some definitions in our data models and we’d like to know who is going to be affected. Are any models or dashboards in Looker using them?
  • Analytics Engineer: Which models are no longer being used? Can we archive them? And what about all these windowed columns in this model, are they being used?

🗺️ Mapping the data journey

Following our principles of data self-serving, we want to produce the right maps that everyone can consume and analyse. Something that sits in between all the data that’s captured and what eventually ends up in the hands of end-users. To help us achieve this, we’re currently experimenting with BigQuery Audit Logs and ZetaSQL.

📒 Using BigQuery Audit Logs to understand how our models are being used

BigQuery stores the audit logs of each and every query run against our data models — whether this query is run from Looker, BigQuery UI, GSheets or Python scripts, we store a bunch of metadata about its execution. Among all this data, one thing stands out: we can get information on which tables and columns were scanned during the execution of a query.

Snippet of the information we can get from Google audit logs.

When running a query, BigQuery produces one log entry for each table that needs access to the warehouse. In the previous example, we see two entries in our audit log, one for each table scanned, and within each entry the different columns that were read. One interesting thing about the result of these logs is that BigQuery prunes all the columns that weren’t needed to execute the query. In the example above, even though we ran the query with a SELECT * FROM accounts, effectively the query only needed two columns from accounts.

In order to prepare all these logs for consumption, we’ve built a data model with a query very similar to the one below. With this data, we have the basics to understand how our main data models are being used. Now we can start to answer juicy questions like: which models aren’t being used, which columns can be deprecated, who’s been accessing the data models, etc.

Example of the information we can extract from the audit logs
Example of the information we can extract from the audit logs

🔗 Column lineage with ZetaSQL

Through the audit logs we can get which tables and columns were scanned by a query, but we don’t get information about any transformations data has gone through — for example, enriching data by combining different columns or changing data through mathematical operators. This can only be extracted with a SQL parser, and we’ve decided to make use of ZetaSQL, an open source tool from Google.

ZetaSQL defines a language (grammar, types, data model, semantics…) as well as a parser and analyser for BigQuery SQL syntax. It is not a database or a query engine — simply put, it allows us to programatically generate and analyse SQL from BigQuery. This gives us the semantics needed to understand how columns are being used and transformed within a query.

Example of what information we can get thanks to column level lineage. This can help us understand how data evolves through models. For example, how the full_name in user_accounts is actually derived from two columns defined in the users model.

Using our compiled models from dbt (i.e. the executable SQL files), we parsed over 4500 models with ZetaSQL, generating a JSON file with a structure similar like the one bellow. For each data model, we’re capturing which columns were scanned for the generation of the output columns. This can involve transformations like CONCAT, or complex analytical functions where the fields in a PARTITION BY and ORDER BY can influence in the generation of a new column. We're also capturing scanned columns that have been used as part of the creation of the data model, but were not involved in any transformation to generate a new column. For example, in our queries from the previous image, we've used the accounts.type field to only get the Monzo accounts, but this field was never used to generate any new columns - these are recorded as auxiliar_columns. Finally we’ve stored all this information in BigQuery so we can easily access it for its analysis.

The next step is all about graphs. We’ve used a python package called networkx to connect all the columns of our models from the metadata that we previously generated, building a graph that we can query by asking for upstream (ancestors) and downstream (descendants) dependencies not only at model level like dbt does, but at also at column level. For example, by running lineage -m accounts+ -c account_id we can now get all the downstream dependencies of the account_id column from accounts.

We’re currently using graphviz for rendering the whole data lineage. The next image shows the output where we wanted to validate the impact of updating a definition in one of our models. The lineage of transformations goes from left to right: on the left side we have the original column, and on the right side the potential Looker dashboards using columns that might have been derived from that original column. In between, we have all the models as well as Looker views and explores that might have transformed or used the original column, as well any transformations from it.

Each box represents a model and the coloured ellipses within these boxes represent the different stages of the downstream lineage of the columns.

🎯 How has this helped us so far?

Even though there is still a lot of work ahead of us, mapping our data at column level has already helped us answer some of the questions we outlined at the beginning of this post. By allowing everyone to produce and analyse the different data maps, we are helping teams to understand what’s the impact of their changes across different levels of our warehouse. This means that we are now in a position where teams can be more proactive by understanding which models, dashboards and users are going to be affected by a change.

🕵️ Scoping changes across our warehouse

We recently had to rename a few of our definitions to be consistent across all our systems, i.e. we had to rename different columns across our warehouse. By having our warehouse mapped at column level we were able to easily identify and isolate all data models and Looker dashboards where we had to make a change. No need to do a regexp search across all our warehouse and Looker. This has saved loads of time to our analysts, helping them being proactive to changes, and not reactive to downstream errors related to the changes they’ve made.

🧹Automatically detect unused content

Having no gatekeepers for our ETL process has given us the agility to be were we are right now, but this also comes at a price: we’re not a small team anymore and having a full understanding of what data assets are really being used and what can be deprecated is not always that straightforward to detect.

With this new information at hand, we’ve run a preliminary analysis and we were able to deprecate more than 300 columns across some of our core models that, because of how we’ve grown, their ownership and maintenance is not that clear anymore. Based on these results we want to automate this process, so we can periodically detect which data assets, either columns or models, can be flagged as stale and deprecated from our production warehouse.

🔁 Keeping Looker in synch with our warehouse

Additionally, keeping our Looker views and BigQuery models in sync is not an easy task. Specially when the changes we make in a model are not correctly reflected in the connected Looker views. But with this new tooling, we can now detect broken links between both Looker fields and BigQuery columns. This has helped us to delete more than 2000 fields that, unfortunately, were not correctly removed when their direct link in BigQuery was deprecated.

Through this post we’ve covered some of the tools we’re testing to help us become Data Cartographers. We want to understand how data evolves and how it’s being used throughout our warehouse to increase it’s quality and keep our data assets up to date. We are still exploring options and trying different solutions, but we think that ZetaSQL and the Audit Logs can have a lot of potential to bring more insights about our warehouse, specially when we need to understand which data ends up in the hands of end-users.

If you’re interested in working for the data team at Monzo, we’re hiring!

--

--