dbt at Super Part 1: Orchestration

Published in

Super.com

8 min readApr 20, 2022

Introduction

dbt has been an integral part of the data platform at Super since August 2018. In that time, roughly 500 models, 2500 tests, and 200 sources have been added to our dbt project, mostly by the data team, although folks outside the data team, like software engineers and technical account managers, routinely contribute as well. The goal of the data platform team, which maintains dbt as a service, is to make it as easy as possible to develop and deploy dbt models that get refreshed at the desired frequency.

Orchestration for dbt

Data orchestration is an “automated process that takes data from multiple storage locations and programmatically allows you to author, schedule, and monitor data pipelines.” In the context of dbt, orchestration refers to the system or process that automatically builds, snapshots, and tests dbt objects like sources and models. Orchestration for dbt typically occurs in one of two ways: proactive and reactive (thanks to Benn Stancil for inspiring this terminology).

Proactive orchestration

Proactive dbt jobs are scheduled at fixed intervals, like once an hour or every night at 2:00 AM. You will use proactive jobs when you want models and snapshots to update at predictable times regardless of the state of the upstream data. Tools like dbt Cloud and Airflow facilitate the creation of proactive dbt jobs by allowing users to define schedules in UI or in code.

Proactive orchestration is simple and effective. Most models do not need to be run more than once per day, and those that do (to include same-day data) can be included in hourly (or every X hour) runs. Occasionally, however, you may want to build models outside of a fixed schedule, which leads one to reactive orchestration.

Reactive orchestration

Reactive dbt orchestration refers to when dbt jobs are triggered immediately after new data gets loaded into the warehouse. This may be required when a dbt job is part of a larger data pipeline, or when stakeholders want to consume the latest data available.

Let’s look at an example. Let’s say there is a marketing report from an external partner that gets updated once per day. The report gets copied from source and loaded into a staging area in the data warehouse in some ELT process. dbt then plays a vital role in transforming this marketing report into something usable before it gets consumed by business stakeholders or ML models. Timeliness is important, as both stakeholders and algorithms both perform best with the latest data available.

In the above case, it is desirable to run downstream dbt models as soon as data gets loaded into the warehouse. The dbt job is part of a larger ELT(L) pipeline where data is ingested, transformed, and then immediately activated at the end of the pipeline (e.g. using reverse ETL). Proactive orchestration, in this case, isn’t very appealing. The marketing report dbt models could be part of the hourly dbt job, but that would necessitate building them 24 times a day even though new data only arrives once. Furthermore, schedules would have to be synced across tools (e.g. the data loader, reverse ETL process, etc.) to minimize the lag between receiving data and activating it.

Reactive orchestration is possible using both dbt Core and dbt Cloud. In both cases, EL pipelines can be paired with dbt jobs using orchestration tools like Airflow, Prefect, or Dagster, through API calls to dbt Cloud, and in the future, through an API call directly to a self-hosted dbt server. Fivetran’s Transformations support reactive orchestration natively. Fivetran lets users define key models to keep fresh, and these models are rebuilt (and optionally tested) whenever an upstream data source loads new data (although this i limited to Fivetran-enabled data sources).

In sum, there are two main modes of dbt orchestration:

Proactive: Run a set of models at a fixed interval, like once an hour, e.g. :

dbt run -s tag:hourly

Reactive: Run a set of models after a data source is updated, e.g.:

dbt run -s source:mysource.table+

Hybrid approaches

Many organizations combine proactive and reactive modes of orchestration, and increasingly, dbt Core and dbt Cloud are adding new features to support hybrid orchestration.

After initially posting this article, Austin Lin Gibbons replied on the dbt slack community with a use case for combining proactive and reactive jobs for specific models. Organizations may want their daily and intraday dbt jobs to be reactive to include the latest data available, while running proactive full refreshes on a weekly or monthly cadence to account for changing historical data. Austin used the example of GDPR compliance, which may require companies to purge data from derivative tables at least once every 30 days. Frequent proactive full refreshes of incremental models also address the problem of late-arriving facts or dimensions.

Furthermore, dbt Core is adding a new feature in 1.1 which allows users to create proactive jobs that are intrinsically reactive. Using state comparison and a brand new selector, dbt will only run models downstream of fresh data, rather than running models repeatedly using stale data:

dbt source freshness
dbt build --select source_status:fresher+ --state path/to/prod/artifacts

This is supported natively in dbt Cloud, which automatically handles state comparison and artifacts storage.

Early dbt orchestration at Super

In the early days of dbt at Super, we did not require advanced orchestration, so all our models, tests, and snapshots were run using hourly or nightly dbt jobs. We used dbt tags to delineate hourly and nightly models, and we ran checks in CI to ensure every model has only one of these tags. We set up jobs in dbt Cloud like the following:

Let’s look at growth analytics for Snaptravel, our hotels product, as an example. At any given point in the company’s recent history, we’ve used 5–10 marketing channels to drive customer acquisition. Granular data on hotel and itinerary-level marketing performance is made available to us by our partners daily at different times depending on the channel. Originally, we had 10–15 pipelines defined in Airflow that fetched data, copied it into S3 and loaded it into our data warehouse, Snowflake. These pipelines looked something like this:

Before dbt, this data was combined with our internal transactional data directly in the BI layer to calculate important metrics like ROAS. After dbt, these reports were modelled and consolidated into an aggregated fact table. Even though each model only needed to be updated once per day, we included them in our hourly dbt runs since the exact timing was often unpredictable — reports could come in any time between 4–7 AM, for example — and the size of the data at the time was small enough to not substantially elongate our dbt runs.

Most organizations, like Super in the early days, can probably survive using proactive orchestration and small set of scheduled jobs like hourly and nightly runs. However, as the use cases for data pile up and the data itself grows in size and surface area, it’s almost inevitable that you will have to consider alternative approaches.

Adding reactive orchestration

As head count at Super grew, our growth analytics function became more sophisticated. The team’s responsibilities grew to include:

Reporting and analytics. Tracking ROAS at the channel level on a daily basis. There was significant drop-off for some of our properties — let’s find out why!
Automated bidding. Updating bids and bid multipliers for channels using the latest available performance data. We’re getting outbid on long-term stays (which are high AOV) in New York because our CPCs are too low — let’s raise them!
Dynamic pricing: Raising or lowering prices based on our price competitiveness compared to other OTAs. Our price for weekend stays at the Bellagio in Las Vegas is 5% higher than competitors — let’s lower it!

All three require combining multiple data sources together and defining advanced metrics dbt. Timeliness became particularly important especially in the case where data finds its way back into the product through automated bidding or dynamic pricing.

You can probably tell where this is going. Reactive dbt orchestration was a natural fit for activating marketing data as soon as it arrived. Triggering dbt jobs inside the data pipelines that manage extraction, loading, and reverse ETL, meant that new bids, prices, and reports could be published immediately upon ingesting new data. A data pipeline might look something like this:

The dbt models selected in steps 3 and 4 are the downstream models from that source:

source:mysource.table+

We also exclude models tagged hourly or nightly to avoid race conditions or data issues, such as in the case where incremental models are not properly defined. We added the “external” deployment tag for models exclusively managed through reactive dbt jobs. We then limited node selection to the intersection between the source’s children and the external tag, like so:

dbt run -s source:mysource.table+,tag:external

Now, analytics engineers and developers have a few orchestration options to choose from when adding new dbt models. The simplest option is using proactive orchestration, where they tag their models hourly or nightly, depending on how important it is to look at same-day data. For more complex use cases, they use reactive orchestration, by tagging models as external and adding dbt tasks directly into an Airflow DAG, like so:

dbt_run = KubernetesPodOperator(
    **super_defaults,
    **dbt_defaults,
    task_id=’dbt_run_task’,
    name=”dbt-run-task”,
    arguments=[“dbt run -s source:mysource+,tag:external”]
    )

Conclusion

In summary, Super uses both proactive and reactive orchestration in its dbt deployment. We used proactive orchestration exclusively at first since all our requirements could be satisfied by hourly and nightly dbt jobs. As the data grew in size and surface area, and operationalizing data became an important use case, we introduced reactive orchestration which permitted timely refreshes of dbt models before loading the results to new destinations. Contributors to our data platform now have several options for orchestrating dbt, and we ensure there is no overlap between proactive and reactive dbt models by checking for tag collisions in CI.

If you want to read more — check out and follow our Super publication here on Medium.