dbt at Super Part 3: Observability

Jonathan Talmi
Super.com
Published in
7 min readJan 31, 2023

In previous posts, we wrote about how we orchestrate and run continuous integration pipelines on our dbt project at Super. In our post about orchestration, we explained how the needs of our business required us to invoke dbt in response to specific triggers or events, often within complex data pipelines, and on a schedule. In our post on CI, we laid out the checks and balances in our dbt project repo that ensure all new models are thoroughly tested, documented, and conform to our style guide and data model.

With Coalesce 2022 happening in October 2022, I want to highlight a few projects and talks that touch on these two subjects:

  • Dylan Hughes from Prefect gave a talk about how to set up event-driven orchestration for dbt using Prefect
  • Kshitij Aranke from Vouch (now dbt Labs) gave a talk on Data Change Management, which covered topics like Slim CI, linting, blue/green deployment
  • The Montreal Analytics team took ownership of the dbt-gloss repo, which has a plethora of pre-commit hooks, like test and documentation coverage

Today’s post is the spiritual sequel to the Coalesce 2021 talk by Kevin Chan and Jonathan Talmi, which has a a companion Discourse post here. In that talk, we showed how we gained observability over dbt and the underlying data by logging rich metadata from dbt runs into the warehouse and creating alerts and visualizations for individual model and test failures, cost swings, and overall job performance.

Our dbt alerts originally ran in Looker and tagged Slack user groups linked to business domains

Over the last two years, our observability system evolved substantially to meet the needs of a fast-growing company. This post will cover these changes and attempt to provide an overview of how we think about observability for dbt.

Motivation

Since 2021, Super launched two new verticals — SuperShop and SuperCash — and company headcount grew to support them. The number of people interfacing with the data platform as developers or consumers grew by 3x as a result. With more pipelines, models, and tests across the stack, the number of alerts grew. Important alerts were sometimes drowned out by noisy ones, and engineers frequently cited “alert fatigue” as a pain point. Furthermore, teams began organizing themselves by domain and vertical, which necessitated a reimagination of our domain-based notification system (sending alerts to “growth” or “finance” was no longer enough!). Finally, we saw an opportunity to both address these shortcomings and improve observability with the help of a few new open source projects

Bearing this in mind, we embarked on a project to revamp alerting for the data team. We set up new Slack channels for team and vertical combinations (e.g. data-ecommerce-alerts), and began funneling alerts from across the stack into them. We added new features like individual user tagging, threading for repetitive alerts, and daily SLA reporting for most data pipelines. This post will be about the updates we made to dbt.

Repeat alerts from Airflow DAGs are added in-thread on Slack if they occur on the same day

What is “dbt observability”?

dbt at its core enables data teams to define and manage data resources (tables, views, etc.) and execute observability workloads (tests, source freshness etc.) on top of those resources.

dbt commands like run, build, and test intelligently string together resource management and observability workloads into pipelines, which we call jobs.

There is a distinction between observability for data and observability for dbt. Numerous tools, including dbt itself, improve data observability by helping you validate the expectations you have over the data in your warehouse. dbt tests do this by definition, and numerous open source packages have extended the core test suite to support more complex data testing like anomaly detection. This is beyond the scope of this post, but we recommend “The Four Pillars of Data Observability” by Kevin Hu of Metaplane for a primer on data observability.

Observability for dbt, by contrast, involves monitoring, alerting, and visibility over dbt jobs themselves. Engineers, analysts, and even business stakeholders need to know when data quality tests fail, models break in production, jobs hit bottlenecks, or costs begin to skyrocket.

We break this down into two rough categories: Job outcomes and job performance. Job outcomes are the results of jobs, like failures in tests, models, snapshots, and source freshness checks. Job performance refers to metrics like execution time and cost, at both the job and individual resource levels. This post will show how we used dbt artifacts, warehouse logs, and open-source packages to gain visibility into both.

Job outcomes

The most important concern for dbt observability is ensuring the right people get notified when things go wrong. When the number of models in your dbt project is small, you can live with a single notification per job and a pair of watchful eyes, but as your project grows to thousands of models and tests, it becomes hard for any single individual to supervise. At that point, it may become desirable to get more granular, and trigger an alert for every model or test failure within a job rather than a single one for the whole job. This way, alerts can be distributed to specific subscribers.

Granular alerting can be accomplished by storing the run results and manifest artifacts in your warehouse and querying them using an alerting library or BI tool. The run results artifact includes detailed execution metadata for all models, tests, and snapshots run in a dbt job (source freshness results are only available in the sources artifact, for now). The manifest includes metadata for each resource in your project, like our custom-defined model owners in the meta config:

models:
- name: dim_travel_user
description: '{{ doc("dim_travel_user") }}'
meta:
owner: "@jonathan"
channel: '{{ var("data_travel") }}'

These artifacts are available through dbt’s Jinja context and can be uploaded to your warehouse using a post-hook at the end of a dbt job. There are a few open source packages like Elementary’s dbt_data_reliability and Brooklyn Data’s dbt_artifacts which have this functionality (we use Elementary).

After we upload artifacts to the warehouse, we use Elementary’s open source python package to send alerts to specific Slack channels tagging model owners. The Elementary alerts contain queries, sample data, and relevant metadata — much richer information than we had before. Here are a few examples:

Source freshness failure alert triggered by Elementary
Test failures alert triggered by Elementary

Performance

It’s also important for the data team to collect data on dbt job executions so that we can visualize performance over time. In our original post, we showcased the visualizations we created in Looker that let us investigate a model’s execution over time and drill-down into specific jobs.

The model execution history helped us identify when models need their warehouses resized, or if we should explore new clustering strategies.

Model execution history

The job execution view helped us identify pipeline bottlenecks in long-running jobs.

Pipeline bottleneck visualizations for hourly and nightly dbt runs

With Elementary, we gained access to a rich report with model and test execution history that we host on a static site in AWS. We use this report to investigate ongoing issues at the model and test level.

Model history view in the Elementary observability report

Cost per model

Back in 2021, we created a fairly primitive cost-per-model metric to help us identify our most expensive dbt models. There is now a much better version available in the open source dbt package dbt-snowflake-monitoring by Select. This package uses the Snowflake query history and warehouse metering history views to construct an accurate picture of cost-per-model and cost-per-workload, which we’ve used to build Looker dashboards like the one below:

Most expensive dbt models using the dbt-snowflake-monitoring package

Conclusion

When you’re running dbt at scale, it is vital to gain observability over your jobs so you can identify and fix data issues before they become large-scale problems. Observability gives your team powerful tools respond to ongoing data quality issues and optimize jobs. We achieved granular observability over dbt using dbt artifacts and open source packages, which allows us drill down to the level of an individual model or test. This fit our needs as a centralized data team at a multi-vertical company with hundreds of models and thousands of tests in a monorepo. Other teams may see more success with different approaches, like a multi-project setup with job-level alerting, which will become substantially easier in the near future. No matter what model you adopt, adding observability to dbt has never been easier, and your team will gain valuable visibility into your project as it grows in complexity.

If you would like to learn more about employement opportunities at Super — check our careers page.

--

--