Testing & Monitoring the Data Platform at Scale

Jacob Holland
Checkout.com-techblog
6 min readSep 22, 2022

Running a data platform at scale for an organisation like Checkout.com can be full of challenges: not the least of which is how to be confident that all the data you’ve carefully ingested, modelled & delivered to stakeholders is guaranteed to adhere to standards that stakeholders rightfully demand. The answer, dear reader, is two-fold: testing & monitoring.

To understand the testing component of Checkout.com’s data platform; you might benefit from reading a previous article on how data is modelled (please see here). TL;DR, we use Airflow to run various workflows & most of these DAGs make use of dbt commands (run & test being quite common) to model & present data to stakeholders.

dbt is a data transformation tool in use heavily at Checkout.com. One of the stand out advantages of dbt is the built-in testing framework which allows assertions to be run over data in a highly customizable way while also being very simple to read & understand.

A dbt run command (as the name suggests) actually executes & stages data for consumption. But how can you be sure that the results of your model executions are correct? The answer lies in the second common dbt command issued: test.

Like almost everything in dbt, tests are SQL queries. In particular, they are select statements that seek to grab “failing” records, ones that disprove your assertion.

Being able to run these assertions across your models with each subsequent DAG run guarantees that the state of your production data models adheres to the standards you’ve defined. But what about standards you can’t define in neat SQL statements? What about edge cases around not-yet-seen scenarios? The answer is in a third-party tool called Monte Carlo.

Testing & Alerting with dbt & Datadog

First up, what are dbt tests? How are they defined? These can be simple SQL statements (as mentioned above), or they can be written in YAML & can be invoked with a simple dbt test command. For the latest & greatest on dbt testing, I’d recommend reading the dbt documentation on the matter.

The dbt test command is invoked in our DAGs in Airflow on a regular interval providing a heartbeat of various data models and providing confidence that data quality is at an acceptable level.

Figure 1: Example of a simple dbt test checking a table for duplicates using a primary key

We have also monitored for freshness in the past using a combination of regular invocations from Airflow of the dbt freshness command, which provides feedback on when data was last loaded to key tables. This gives us additional confidence that data is not only of acceptable quality but also that our data pipelines are working as expected & no silent errors are impacting throughput.

Figure 2: Example of a freshness test defined in a .yaml

The next question becomes: what to do when an assertion is breached? It helps to understand the structure of a DAG, which makes use of dbt tests (please see below for an example).

Figure 3: Example of a DAG with dbt test commands being issued

Each of these blocks represents a task within a DAG, and the result of each task is sent to our monitoring tool of choice: Datadog.

While Airflow makes use of some standard metrics which are sent to Datadog, we opted to include additional metrics (see epoch8 for more details) for more fine-grained task-level metric reporting (please note, some of the benefits of additional metrics might be redundant since the recent release of Airflow v2.3). These metrics report on the state of your data in a highly extendible & repeatable manner.

This data is pushed to Datadog as often as the DAGs run (via the Datadog agent that accompanies our Airflow worker in a sidecar container), and once these metrics land in Datadog, we set up monitors & alerts that direct traffic to the appropriate endpoints for action by stakeholders. For example, we send non-critical test failures to non-critical slack channels for eventual action.

Figure 4: Example of a Datadog monitor reporting results to Slack

Testing & Observability with Monte Carlo

Monte Carlo is a tool that uses machine learning to understand “normal” conditions for a pipeline. The idea of “normal” conditions can be quite difficult to define as two perfectly healthy pipelines might deliver data at vastly different cadences or in different formats, etc.

Monte Carlo, unlike dbt tests, can return alerts based on behaviour it measures independently of anything defined by users. Also, unlike with dbt tests above (which are paired with Airflow & Datadog to run commands on a schedule & report results of these tests via Datadog to critical channels) Monte Carlo pairs with PagerDuty to monitor a “normal” state (defined as completely independent of individual input) & alert users to issues.

But how does this work? Well, let’s take a quick tour of the tools at our disposal within Monte Carlo. Similarly to dbt, you can define assertions in SQL statements (known as SQL Rule Breaches) to proactively define a “normal” state for your pipeline. In addition to this, however, there are event-driven alerts that can be run outside of any specific user-defined rules:

Figure 5: Table of Monte Carlo Monitors

For more details on these types of alerts, I’d refer the reader to the proper documentation here. But with all of these alerts in our toolkit, we can be sure that problems upstream that result in pipelines drying up completely, or propagating potentially damaging changes downstream or changes that pipeline architects can design for are all raised & sent to the appropriate alert channel.

Figure 6: Example of a Monte Carlo volume alert

These alerts can be directed to multiple endpoints, but for critical visibility events; we direct these alerts to PagerDuty & Slack.

Critical alerts can be sent to PagerDuty & wake up the appropriate engineer at any time of day or night to be actioned in the case of our most critical & time sensitive pipelines. Slack alerts can be sent to the appropriate Slack channel to be investigated during normal working hours & actioned appropriately.

Figure 7: Example of a Slack alert received via PagerDuty in response to a Monte Carlo freshness alert

With dbt running in tandem to our pipeline transformations & the flexibility offered by Monte Carlo we have the tools to respond to scenarios we’ve thought of in advance as well as scenarios we hadn’t considered. We can also gain additional insight into the behaviour of upstream services when they introduce “trickle-down” changes to our domain. We have the appropriate processes guarding the data platform so that we can be sure that the data we are ingesting adheres to data quality standards, freshness & latency standards and gives our stakeholders the ability to analyse the data we provide with the confidence that it is correct & timely.

We hope that you’ve enjoyed this jaunt through the world of data at Checkout. Seeing this high-level implementation of data observability might leave the more curious audience wanting more detail. Well, watch this space for the next release from the Data Platform team for an exciting in-depth exploration of data observability with Monte Carlo.

--

--