Stories by Jonathan Talmi on Medium

dbt at Super Part 3: Observability

Jonathan Talmi — Tue, 31 Jan 2023 15:45:42 GMT

In previous posts, we wrote about how we orchestrate and run continuous integration pipelines on our dbt project at Super. In our post about orchestration, we explained how the needs of our business required us to invoke dbt in response to specific triggers or events, often within complex data pipelines, and on a schedule. In our post on CI, we laid out the checks and balances in our dbt project repo that ensure all new models are thoroughly tested, documented, and conform to our style guide and data model.

With Coalesce 2022 happening in October 2022, I want to highlight a few projects and talks that touch on these two subjects:

Dylan Hughes from Prefect gave a talk about how to set up event-driven orchestration for dbt using Prefect
Kshitij Aranke from Vouch (now dbt Labs) gave a talk on Data Change Management, which covered topics like Slim CI, linting, blue/green deployment
The Montreal Analytics team took ownership of the dbt-gloss repo, which has a plethora of pre-commit hooks, like test and documentation coverage

Today’s post is the spiritual sequel to the Coalesce 2021 talk by Kevin Chan and Jonathan Talmi, which has a a companion Discourse post here. In that talk, we showed how we gained observability over dbt and the underlying data by logging rich metadata from dbt runs into the warehouse and creating alerts and visualizations for individual model and test failures, cost swings, and overall job performance.

Our dbt alerts originally ran in Looker and tagged Slack user groups linked to business domains

Over the last two years, our observability system evolved substantially to meet the needs of a fast-growing company. This post will cover these changes and attempt to provide an overview of how we think about observability for dbt.

Motivation

Since 2021, Super launched two new verticals — SuperShop and SuperCash — and company headcount grew to support them. The number of people interfacing with the data platform as developers or consumers grew by 3x as a result. With more pipelines, models, and tests across the stack, the number of alerts grew. Important alerts were sometimes drowned out by noisy ones, and engineers frequently cited “alert fatigue” as a pain point. Furthermore, teams began organizing themselves by domain and vertical, which necessitated a reimagination of our domain-based notification system (sending alerts to “growth” or “finance” was no longer enough!). Finally, we saw an opportunity to both address these shortcomings and improve observability with the help of a few new open source projects

Bearing this in mind, we embarked on a project to revamp alerting for the data team. We set up new Slack channels for team and vertical combinations (e.g. data-ecommerce-alerts), and began funneling alerts from across the stack into them. We added new features like individual user tagging, threading for repetitive alerts, and daily SLA reporting for most data pipelines. This post will be about the updates we made to dbt.

Repeat alerts from Airflow DAGs are added in-thread on Slack if they occur on the same day

What is “dbt observability”?

dbt at its core enables data teams to define and manage data resources (tables, views, etc.) and execute observability workloads (tests, source freshness etc.) on top of those resources.

dbt commands like run, build, and test intelligently string together resource management and observability workloads into pipelines, which we call jobs.

There is a distinction between observability for data and observability for dbt. Numerous tools, including dbt itself, improve data observability by helping you validate the expectations you have over the data in your warehouse. dbt tests do this by definition, and numerous open source packages have extended the core test suite to support more complex data testing like anomaly detection. This is beyond the scope of this post, but we recommend “The Four Pillars of Data Observability” by Kevin Hu of Metaplane for a primer on data observability.

Observability for dbt, by contrast, involves monitoring, alerting, and visibility over dbt jobs themselves. Engineers, analysts, and even business stakeholders need to know when data quality tests fail, models break in production, jobs hit bottlenecks, or costs begin to skyrocket.

We break this down into two rough categories: Job outcomes and job performance. Job outcomes are the results of jobs, like failures in tests, models, snapshots, and source freshness checks. Job performance refers to metrics like execution time and cost, at both the job and individual resource levels. This post will show how we used dbt artifacts, warehouse logs, and open-source packages to gain visibility into both.

Job outcomes

The most important concern for dbt observability is ensuring the right people get notified when things go wrong. When the number of models in your dbt project is small, you can live with a single notification per job and a pair of watchful eyes, but as your project grows to thousands of models and tests, it becomes hard for any single individual to supervise. At that point, it may become desirable to get more granular, and trigger an alert for every model or test failure within a job rather than a single one for the whole job. This way, alerts can be distributed to specific subscribers.

Granular alerting can be accomplished by storing the run results and manifest artifacts in your warehouse and querying them using an alerting library or BI tool. The run results artifact includes detailed execution metadata for all models, tests, and snapshots run in a dbt job (source freshness results are only available in the sources artifact, for now). The manifest includes metadata for each resource in your project, like our custom-defined model owners in the meta config:

models:
- name: dim_travel_user
  description: '{{ doc("dim_travel_user") }}'
  meta:
    owner: "@jonathan"
    channel: '{{ var("data_travel") }}'

These artifacts are available through dbt’s Jinja context and can be uploaded to your warehouse using a post-hook at the end of a dbt job. There are a few open source packages like Elementary’s dbt_data_reliability and Brooklyn Data’s dbt_artifacts which have this functionality (we use Elementary).

After we upload artifacts to the warehouse, we use Elementary’s open source python package to send alerts to specific Slack channels tagging model owners. The Elementary alerts contain queries, sample data, and relevant metadata — much richer information than we had before. Here are a few examples:

Source freshness failure alert triggered by Elementary

Test failures alert triggered by Elementary

Performance

It’s also important for the data team to collect data on dbt job executions so that we can visualize performance over time. In our original post, we showcased the visualizations we created in Looker that let us investigate a model’s execution over time and drill-down into specific jobs.

The model execution history helped us identify when models need their warehouses resized, or if we should explore new clustering strategies.

Model execution history

The job execution view helped us identify pipeline bottlenecks in long-running jobs.

Pipeline bottleneck visualizations for hourly and nightly dbt runs

With Elementary, we gained access to a rich report with model and test execution history that we host on a static site in AWS. We use this report to investigate ongoing issues at the model and test level.

Model history view in the Elementary observability report

Cost per model

Back in 2021, we created a fairly primitive cost-per-model metric to help us identify our most expensive dbt models. There is now a much better version available in the open source dbt package dbt-snowflake-monitoring by Select. This package uses the Snowflake query history and warehouse metering history views to construct an accurate picture of cost-per-model and cost-per-workload, which we’ve used to build Looker dashboards like the one below:

Most expensive dbt models using the dbt-snowflake-monitoring package

Conclusion

When you’re running dbt at scale, it is vital to gain observability over your jobs so you can identify and fix data issues before they become large-scale problems. Observability gives your team powerful tools respond to ongoing data quality issues and optimize jobs. We achieved granular observability over dbt using dbt artifacts and open source packages, which allows us drill down to the level of an individual model or test. This fit our needs as a centralized data team at a multi-vertical company with hundreds of models and thousands of tests in a monorepo. Other teams may see more success with different approaches, like a multi-project setup with job-level alerting, which will become substantially easier in the near future. No matter what model you adopt, adding observability to dbt has never been easier, and your team will gain valuable visibility into your project as it grows in complexity.

If you would like to learn more about employement opportunities at Super — check our careers page.

dbt at Super Part 3: Observability was originally published in Super.com on Medium, where people are continuing the conversation by highlighting and responding to this story.

dbt at Snapcommerce Part 2: Continuous Integration

Jonathan Talmi — Wed, 12 Oct 2022 14:10:29 GMT

dbt at Super Part 2: Continuous Integration

Introduction

The first post in this series covered orchestration for dbt projects, including the concepts of proactive and reactive orchestration. I explained our use cases for both orchestration modes and how we do deploy both as efficiently as possible. It’s been exciting to see orchestration become a conversation topic in the data community over the past few months. Just to highlight a few examples:

Dagster introduced Software-Defined Assets and reconciliation-based orchestration, which facilitates the rematerialization of assets like dbt models based on observed changes in the asset’s metadata and upstream landscape
dbt introduced the source_status selection method to dbt Core, which enables users to trigger proactive dbt pipelines based on status of upstream data
Benn Stancil wrote a blog post about how data assets, and their desired freshness, should drive orchestration — and not the other way around

It’s increasingly looking like many of the complexities and redundancies I covered in the first post are being actively worked on and may not be long for this world — which I welcome with open arms!

This post will “shift-left” to the time before the data asset — in this case, a dbt model — gets materialized in production. It will cover Continuous Integration (CI), code linting, and dbt project organization, showcasing the system built at Super (FKA Snapcommerce) to manage the complexity of our dbt project as we scaled to hundreds of models and dozens of users.

What is continuous integration?

If you’re reading this blog post, there’s a good chance you’ve encountered Continuous Integration (CI) before even if you don’t know it. CI encapsulates a set of practices engineering teams use to facilitate collaboration and rapid iteration for software projects. These practices promote frequent contributions to a shared code repository with automated testing in place to detect errors before they enter production.

In the context of dbt projects, many standard elements of CI have already been widely adopted by the dbt community, like maintaining a single source code repository, and automated builds using virtual environments and Docker. CI can also be used to automatically run tests and linting checks whenever dbt code is pushed to the repository and before it gets merged into main.

For example, let’s say your company wants to start incorporating revenue from commissions into its net revenue calculation. You make a change in the mart_financial_metricsdbt model, which shows up on Github like this:

- sale_price - supplier_cost                        AS net_revenue,
+ sale_price - supplier_cost + (0.05 * sales_price) AS net_revenue,

Your code change is reviewed but you don’t have a CI process set up, so you merge it in after approval. The new codebase gets picked up automatically by an orchestrator like dbt Cloud, which starts running your updated model. Suddenly, you see alerts firing in the #dbt-alerts channel. You check the logs, and see the mart_financial_metrics model failing due to a compilation error. You double-check the code and discover that you accidentally wrote sales_price instead of sale_price when adding commission revenue, leading to the error. You quickly submit a fix.

This scenario, and the associated stress, can be prevented with a relatively simple CI process that runs and/or tests the changed dbt model. Other small misconfigurations, like a YAML input error, can be detected even earlier by simply compiling the dbt project. Finally, more advanced tests, tools, and techniques can be used to assess regressions which are harder to detect, like downstream impacts of a relatively innocuous change in business

Continuous Integration for dbt projects

What are we trying to do in CI for dbt projects? The answer is not as obvious as you might think. I’ve seen a multitude of approaches across to running and testing dbt code before it enters production across the industry. Here are a few examples:

Some simply compile their dbt project, and if successful, deploy it to a staging environment for manual testing and QA.
Some build and test modified models into a temporary environment using an approach called slim CI.
Some (like the Shopify data team), don’t test data in CI, but test the model logic alone through unit testing.

For this post, I will focus on the dbt jobs, python scripts, and data warehouse configurations we use to accomplish our version of a robust CI process. I will first talk about testing, and cover the different types of tests we use in our dbt project (data tests and unit tests) and the ones we enforce on every model. I’ll discuss linting, and how to enforce a SQL style guide in your repo. Finally, I’ll discuss project level checks which ensure our data model and project structure kept intact with every new addition.

Compile

Even the most simplistic CI process must fulfill this basic requirement: Ensure the dbt project compiles! Projects fail to compile when there is misconfigured YAML or Jinja, like references to a model, source, or macro that doesn’t exist. If your project fails to compile, it won’t run in your orchestrator, so it’s imperative you ensure compilation before any new code gets merged in. This can be checked using:

dbt ls > /dev/null      # option 1
dbt --warn-error parse  # option 2

The warn-error config is optional, and will raise an error anytime dbt would normally warn. Warnings are raised in when there are YAML configs for nodes that doesn’t exist, so including this flag helps keep your dbt project nice and tidy .

Compilation errors are surfaced during run, test, or build,but we run the above commands early in our CI pipelines to fail fast when there is a simple compilation issue. Specifically we run dbt —-warn-error parsein our lint stage, and the following in our test stage:

dbt --warn-error ls --select state:modified+ --resource-type model

This helps developers quickly identify the models that will be built in this CI pipeline in case they forgot to include new models in their commits.

Run and test dbt models

A central tenet of continuous integration is tests should run in a clone of the production environment. This ensures all possible bugs that could occur in production are uncovered and fixed in the test environment. Luckily for us, cloud data warehouses make it easy to replicate production into an infinite number of CI pipelines, and dbt makes it even easier with Slim CI.

The most important part of your CI process after checking compilation is building and testing your new and modified models in a test environment in your warehouse. Building models ensure that whatever SQL you’ve added or augmented can compile and is free of syntax errors. Testing these models ensures that your new and modified models all satisfy the data quality or business logic constraints you’ve set for them. We run the following dbt job in CI:

dbt build -s state:modified+ --full-refresh --defer

This command builds and tests all modified seeds, models and snapshots in order of their position in the dbt DAG, which has the nice feature of failing the pipeline early when there is a test failure.

When a model has a lot of children, CI pipelines can take a long time. We use a few techniques to speed up development pipelines, but some organizations limit the scope of the state:modified+ selector to only include immediate children.

Test coverage

Testing for dbt projects can be roughly bucketed into two categories:

Data testing: Tests on top of dbt models built using production data, typically investigating data quality like uniqueness, nullability, and consistency.
Unit testing: Tests evaluating model logic by running mock input data through the model and comparing it against expected output.

Both of these types of tests leverage dbt’s built-in dbt testing framework, but only data testing is natively supported by dbt-core, for now.

Over the last year or so, there has been an explosion in unit testing frameworks for dbt, signalling the desire of the community to adopt a more conventional and robust approach to testing dbt models. At Super, we adopted unit testing using the dbt-datamocktool package by Michael Irvine when our models became complex enough to render manual testing tedious and error-prone.

To ensure the models in our dbt project are well-tested, we require each model to possess at least one unique test and one not null test, but developers are encouraged to add more tests, including unit tests, to fit their needs. We enforce this using a custom python script, but can also be accomplished using the dbt-project-evaluator package, which will be covered later.

Model configs

When making updates to lots of models, or sometimes even full directories, it can be difficult to ensure that all configs remain the same. We run a small python script to print out the diffs in configs for changed models, which makes it easy to identify when materializations, tags, or other configs have been changed errantly:

Code linting

Linting for software projects is a process that uses static program analysis to flag stylistic errors, improper conventions, and bugs. For dbt projects, it’s used to ensure your SQL (and now python) code follows a consistent style guide. We use SQLFluff and a custom ruleset to enforce common conventions across our SQL codebase (and in the future, we may adopt something like black for Python models). We execute SQLfluff on new models only, using the following command:

diff-quality --violations sqlfluff --compare-branch=main --fail-under=100

Documentation and YAML

CI processes can also be used to apply dbt project standards as defined by the project owners. At Super, we value robust documentation, so we use CI to ensure that all new models, snapshots, sources, and columns are documented

All new models, sources, and snapshots must have the following:

Node description
Column descriptions
Unique and not null tests (models and snapshots only)
Freshness condition (sources only)
Owner (slack handle)
Channel (slack channel)

The last two are added as meta fields and are used by our observability system to distribute alerts:

meta:
    owner: "@jonathan"
    channel: "ops-alerts

Project and DAG linting

Many organizations that have implemented dbt without strongly enforcing their project structure have come to know the pain of the “spaghetti DAG”:

Image taken from dbt Labs’ “Data modeling techniques for more modularity”

In the spaghetti DAG, models are layered on top of each other without consideration or consolidation, leading to extremely wide DAGs with incoherent model relationships. Not only is this bad for analytics engineers’ sanity, but it’s bad for performance and consistency, as the same logic is duplicated across the project leading to unnecessary compute and business logic drift. A great primer on how to adopt a modular approach to your dbt project can be found here and here.

CI can play an important role in preventing this spaghetti DAG from forming in the first place. Rules can be defined and applied in CI to block violations from making their way into the codebase, keeping project structure intact. For example, it’s best practice to create a single staging model for every source — a script can see if this relationship holds by parsing the manifest associated with the proposed code change in CI.

At Super, we run a few relationship checks on every commit to make sure the pull request conforms to our project standards. These include:

1:1 relationship of source to stg model
fct/dim models only depend on stg models
mart models only depend on fct/dim models

While a few early projects attempted to create tools for this, dbt Labs’ recently released dbt-project-evaluator is the most advanced open source package yet, with a plethora of rules to cover even complex use cases like rejoins and model fanouts.

Auditing and validation

The above checks help ensure that code is clean, readable, well-tested and well-documented before it gets merged in, but the developer still needs to validate the raw business logic they are adding or modifying. QA can be a tedious process especially when an existing model is being changed. Unit tests can be added to affected models, but they are fairly cumbersome to write and don’t capture downstream implications!

A thread recently cropped up on Twitter detailing this exact problem, which led to a bunch of interesting responses:

pedram growth hacker on Twitter: "What checks do you do after building/changing a model? Apart from your standard dbt PK / not null tests, how do you validate what you built? We cant all just be manually going through the table and scrolling like hmm...yes...looks right to me..can we? / Twitter"

What checks do you do after building/changing a model? Apart from your standard dbt PK / not null tests, how do you validate what you built? We cant all just be manually going through the table and scrolling like hmm...yes...looks right to me..can we?

Many repliers talked about auditing the results with varying levels of sophistication. Some check primary keys and row counts, while Sarah Krasnik wrote:

Any column that’s changed, I do before/after stats if it’s numeric like mean/median/max/min and make sure the difference is warranted. If it’s a string but has <10 unique values, do counts for all the values and again make sure the change is warranted. Document all in the PR!

At Super, we use a paid tool called Datafold to automate much of the QA process. Datafold runs a detailed diff for every dbt model (and child model) touched in a pull request. It appends a merge comment to the PR with a preview of the full results, like the one seen below:

Datafold makes it easy for developers to investigate the final output of their models and metrics, and can be a powerful tool for change management since it compares new versions of models with the existing production ones. However, if you are looking for free or open-source options, the dbt audit helper package is a great alternative. This package contains macros which perform diffs between models and tables, and it can be run manually or hacked into CI using some creativity.

Lineage

One of the most important aspects of change management for dbt is managing all the downstream implications outside of dbt. When someone makes a pull request to our dbt repo, they should be able to easily identify all the downstream assets that might be affected, like dashboards, visualizations, and reverse ETL syncs. Exposures were introduced into dbt for this exact reason, but creating and maintaining them requires a significant amount of manual work.

At Super, we don’t have an automated solution for this. We maintain a list of repos and services developers should check so they don’t break anything with a dbt model update. There are signs vendors are beginning to take note, as shown by the recent partnership between Datafold and Hightouch, which surfaces downstream Hightouch syncs in Datafold’s pull request comments — but we would welcome an open source solution to this problem too.

Conclusion

In this post, we broke down our CI process for dbt projects. In order to keep our codebase clean and bug-free, we run and test all new or modified models, enforce test and documentation coverage, lint the SQL code, check that the new code is consistent with our project structure, and finally surface data quality checks and diffs to facilitate QA.

If you want to join the Super team — check out our careers page.

dbt at Snapcommerce Part 2: Continuous Integration was originally published in Super.com on Medium, where people are continuing the conversation by highlighting and responding to this story.

dbt at Super Part 1: Orchestration

Jonathan Talmi — Wed, 20 Apr 2022 19:05:47 GMT

Introduction

dbt has been an integral part of the data platform at Super since August 2018. In that time, roughly 500 models, 2500 tests, and 200 sources have been added to our dbt project, mostly by the data team, although folks outside the data team, like software engineers and technical account managers, routinely contribute as well. The goal of the data platform team, which maintains dbt as a service, is to make it as easy as possible to develop and deploy dbt models that get refreshed at the desired frequency.

Orchestration for dbt

Data orchestration is an “automated process that takes data from multiple storage locations and programmatically allows you to author, schedule, and monitor data pipelines.” In the context of dbt, orchestration refers to the system or process that automatically builds, snapshots, and tests dbt objects like sources and models. Orchestration for dbt typically occurs in one of two ways: proactive and reactive (thanks to Benn Stancil for inspiring this terminology).

Proactive orchestration

Proactive dbt jobs are scheduled at fixed intervals, like once an hour or every night at 2:00 AM. You will use proactive jobs when you want models and snapshots to update at predictable times regardless of the state of the upstream data. Tools like dbt Cloud and Airflow facilitate the creation of proactive dbt jobs by allowing users to define schedules in UI or in code.

Proactive orchestration is simple and effective. Most models do not need to be run more than once per day, and those that do (to include same-day data) can be included in hourly (or every X hour) runs. Occasionally, however, you may want to build models outside of a fixed schedule, which leads one to reactive orchestration.

Reactive orchestration

Reactive dbt orchestration refers to when dbt jobs are triggered immediately after new data gets loaded into the warehouse. This may be required when a dbt job is part of a larger data pipeline, or when stakeholders want to consume the latest data available.

Let’s look at an example. Let’s say there is a marketing report from an external partner that gets updated once per day. The report gets copied from source and loaded into a staging area in the data warehouse in some ELT process. dbt then plays a vital role in transforming this marketing report into something usable before it gets consumed by business stakeholders or ML models. Timeliness is important, as both stakeholders and algorithms both perform best with the latest data available.

In the above case, it is desirable to run downstream dbt models as soon as data gets loaded into the warehouse. The dbt job is part of a larger ELT(L) pipeline where data is ingested, transformed, and then immediately activated at the end of the pipeline (e.g. using reverse ETL). Proactive orchestration, in this case, isn’t very appealing. The marketing report dbt models could be part of the hourly dbt job, but that would necessitate building them 24 times a day even though new data only arrives once. Furthermore, schedules would have to be synced across tools (e.g. the data loader, reverse ETL process, etc.) to minimize the lag between receiving data and activating it.

Reactive orchestration is possible using both dbt Core and dbt Cloud. In both cases, EL pipelines can be paired with dbt jobs using orchestration tools like Airflow, Prefect, or Dagster, through API calls to dbt Cloud, and in the future, through an API call directly to a self-hosted dbt server. Fivetran’s Transformations support reactive orchestration natively. Fivetran lets users define key models to keep fresh, and these models are rebuilt (and optionally tested) whenever an upstream data source loads new data (although this i limited to Fivetran-enabled data sources).

In sum, there are two main modes of dbt orchestration:

Proactive: Run a set of models at a fixed interval, like once an hour, e.g. :

dbt run -s tag:hourly

Reactive: Run a set of models after a data source is updated, e.g.:

dbt run -s source:mysource.table+

Hybrid approaches

Many organizations combine proactive and reactive modes of orchestration, and increasingly, dbt Core and dbt Cloud are adding new features to support hybrid orchestration.

After initially posting this article, Austin Lin Gibbons replied on the dbt slack community with a use case for combining proactive and reactive jobs for specific models. Organizations may want their daily and intraday dbt jobs to be reactive to include the latest data available, while running proactive full refreshes on a weekly or monthly cadence to account for changing historical data. Austin used the example of GDPR compliance, which may require companies to purge data from derivative tables at least once every 30 days. Frequent proactive full refreshes of incremental models also address the problem of late-arriving facts or dimensions.

Furthermore, dbt Core is adding a new feature in 1.1 which allows users to create proactive jobs that are intrinsically reactive. Using state comparison and a brand new selector, dbt will only run models downstream of fresh data, rather than running models repeatedly using stale data:

dbt source freshness
dbt build --select source_status:fresher+ --state path/to/prod/artifacts

This is supported natively in dbt Cloud, which automatically handles state comparison and artifacts storage.

Early dbt orchestration at Super

In the early days of dbt at Super, we did not require advanced orchestration, so all our models, tests, and snapshots were run using hourly or nightly dbt jobs. We used dbt tags to delineate hourly and nightly models, and we ran checks in CI to ensure every model has only one of these tags. We set up jobs in dbt Cloud like the following:

Let’s look at growth analytics for Snaptravel, our hotels product, as an example. At any given point in the company’s recent history, we’ve used 5–10 marketing channels to drive customer acquisition. Granular data on hotel and itinerary-level marketing performance is made available to us by our partners daily at different times depending on the channel. Originally, we had 10–15 pipelines defined in Airflow that fetched data, copied it into S3 and loaded it into our data warehouse, Snowflake. These pipelines looked something like this:

Before dbt, this data was combined with our internal transactional data directly in the BI layer to calculate important metrics like ROAS. After dbt, these reports were modelled and consolidated into an aggregated fact table. Even though each model only needed to be updated once per day, we included them in our hourly dbt runs since the exact timing was often unpredictable — reports could come in any time between 4–7 AM, for example — and the size of the data at the time was small enough to not substantially elongate our dbt runs.

Most organizations, like Super in the early days, can probably survive using proactive orchestration and small set of scheduled jobs like hourly and nightly runs. However, as the use cases for data pile up and the data itself grows in size and surface area, it’s almost inevitable that you will have to consider alternative approaches.

Adding reactive orchestration

As head count at Super grew, our growth analytics function became more sophisticated. The team’s responsibilities grew to include:

Reporting and analytics. Tracking ROAS at the channel level on a daily basis. There was significant drop-off for some of our properties — let’s find out why!
Automated bidding. Updating bids and bid multipliers for channels using the latest available performance data. We’re getting outbid on long-term stays (which are high AOV) in New York because our CPCs are too low — let’s raise them!
Dynamic pricing: Raising or lowering prices based on our price competitiveness compared to other OTAs. Our price for weekend stays at the Bellagio in Las Vegas is 5% higher than competitors — let’s lower it!

All three require combining multiple data sources together and defining advanced metrics dbt. Timeliness became particularly important especially in the case where data finds its way back into the product through automated bidding or dynamic pricing.

You can probably tell where this is going. Reactive dbt orchestration was a natural fit for activating marketing data as soon as it arrived. Triggering dbt jobs inside the data pipelines that manage extraction, loading, and reverse ETL, meant that new bids, prices, and reports could be published immediately upon ingesting new data. A data pipeline might look something like this:

The dbt models selected in steps 3 and 4 are the downstream models from that source:

source:mysource.table+

We also exclude models tagged hourly or nightly to avoid race conditions or data issues, such as in the case where incremental models are not properly defined. We added the “external” deployment tag for models exclusively managed through reactive dbt jobs. We then limited node selection to the intersection between the source’s children and the external tag, like so:

dbt run -s source:mysource.table+,tag:external

Now, analytics engineers and developers have a few orchestration options to choose from when adding new dbt models. The simplest option is using proactive orchestration, where they tag their models hourly or nightly, depending on how important it is to look at same-day data. For more complex use cases, they use reactive orchestration, by tagging models as external and adding dbt tasks directly into an Airflow DAG, like so:

dbt_run = KubernetesPodOperator(
    **super_defaults,
    **dbt_defaults,
    task_id=’dbt_run_task’,
    name=”dbt-run-task”,
    arguments=[“dbt run -s source:mysource+,tag:external”]
    )

Conclusion

In summary, Super uses both proactive and reactive orchestration in its dbt deployment. We used proactive orchestration exclusively at first since all our requirements could be satisfied by hourly and nightly dbt jobs. As the data grew in size and surface area, and operationalizing data became an important use case, we introduced reactive orchestration which permitted timely refreshes of dbt models before loading the results to new destinations. Contributors to our data platform now have several options for orchestrating dbt, and we ensure there is no overlap between proactive and reactive dbt models by checking for tag collisions in CI.

If you want to read more — check out and follow our Super publication here on Medium.

dbt at Super Part 1: Orchestration was originally published in Super.com on Medium, where people are continuing the conversation by highlighting and responding to this story.