Automating DBT + Airflow
Deploying DBT Models at Hootsuite 🚀
As Hootsuite’s data platform team, we are always looking for ways to better support our analytics teams that are creating and maintaining data models. This is why we adopted DBT, an industry standard data modeling tool.
As a team that uses Airflow to orchestrate over 300 data models every day, we ran into a few challenges around how to automate dependency management and deployment at scale. While migrating to DBT, our main goal was to automate as much of the DBT deployment process as possible, in order to help our analytics teams #GoFastBeAgile.
DBT Automated Deployment Requirements
- Run DBT models in groups — We can’t run all of our DBT models at one time, as we have too many models & dependencies and it would take too long.
- Automatically find dependencies between groups of DBT models — DBT can already handle the order of running models within a group, but we want to handle dependencies across groups.
- Automatically find dependencies on non-DBT Airflow DAGs — If our Salesforce DAG in Airflow updates a table used in a DBT model, ensure that the DBT model runs after the Salesforce DAG completes.
- Convert groups of DBT models into Airflow DAGs — We orchestrate everything in Airflow, including data models.
Why Split up DBT models into Groups?
With over 300 data models, we wanted to split up our DBT models into groups to run in separate Airflow DAGs for a few reasons:
- Avoid a single point of failure — If there is an error that causes one model to fail, we want to stop the relevant downstream models & data pipelines from executing, while not impacting unrelated downstream models and pipelines.
- Avoid the overhead of running models individually — While running all models at once is problematic, running models individually is unnecessarily verbose. And maintaining an extra 300 Airflow DAGs sounds like a headache.
- Different inbound data pipelines finish at different times — We want to get fresh data out as quickly as possible and not have models waiting on unrelated data pipelines to finish.
Building the Manifest Parser
In order to automate deployment in our CI/CD pipelines, we built a tool to parse the DBT manifest.
The DBT Manifest
DBT has a manifest file (manifest.json
) that is a “single file that contains a full representation of your DBT project’s resources (models, tests, macros, etc.), including all node configurations and resource properties”. This file can be parsed to get all the information from a DBT project and can be generated at build time.
High Level Overview
At a high level, the logic for the manifest parser looks like the following:
manifest = load_manifest()
groups = manifeset_to_groups(manifest)
groups = add_cross_group_dependencies(groups)
groups = add_external_dependencies(groups)
write_groups_to_airflow_dags(groups)
Let’s take a look at how we tackled each of the requirements we had for deploying DBT models and how they fit into the manifest parser.
Grouping DBT Models
We grouped DBT models by directory. Models in the same directory are run together. This has the additional benefit of following the same structure as our legacy data modeling repo. There are other possible ways to group DBT models, but grouping by directory worked best for us.
models/
staging/
salesforce/ # These models run together
stg_campaigns.sql
stg_leads.sql
zendesk/ # These models run together
stg_tickets.sql
stg_users.sql
Each DBT model is a node in the manifest. By iterating through nodes
, we can group models by path.
groups = dict()
for _, node in manifest['nodes'].items():
directory_group = Path(node['path']).parent
if node['resource_type'] in ['model', 'snapshot']:
groups[directory_group] = node
Additionally, we can find the upstream and downstream tasks of a model using the parent_map
and child_map
in the manifest.
Dependencies Between Groups of DBT models
Now that we have the DBT models partitioned into groups, we can iterate through and add dependencies between groups.
for obj in group:
for parent in manifest['parent_map'][obj.dbt_node_name]
if obj.directory_group != parent.directory_group:
add_cross_group_dep(obj, parent)
This ensures that groups of DBT models run in the correct order. This will eventually be converted into an ExternalTaskSensor
in Airflow.
Dependencies on Non-DBT Airflow DAGs
Before implementing DBT, one of the biggest pain points for our team was manually connecting the Airflow DAGs that ran data models to their upstream dependencies (i.e. Inbound data pipelines). This was time consuming and error prone.
With DBT, there is only one manual step in the process. When a new inbound data pipelines is created, the tables outside of DBT (i.e. sources) are mapped to the Airflow DAGs that updated those tables.
We did this by adding a meta
argument to the source. What’s great about meta
arguments is that they show up in the manifest in the source_meta
field for a node.
- name: salesforce
tables:
- name: campaigns
- name: leads
meta:
airflow_dags: ["salesforce_pipeline"] # This is the dag_id in Airflow
The salesforce_pipeline
Airflow DAG must complete before a DBT model that references {{ source(“salesforce", “leads”) }}
can run. An ExternalTaskSensor
is used to do this.
This has an additional benefit of making migration easier, as legacy data model DAGs in Airflow could be treated like inbound data pipelines and this setup ensures that models are run in the correct order.
Converting DBT models into Airflow DAGs
We took advantage of dag-factory
, a library for writing Airflow DAGs as YAML. We used a base template to fill in fields that should be the same across all DAGs (e.g. Callback methods, timeout duration). After mapping some fields, we can write the dictionary to YAML.
DBT_test_dag:
default_args:
on_failure_callback: cb.failure
on_success_callback: cb.success
tasks:
dbt_node:
command: run --select test_model other_model --target dev
dependencies:
- upstream_example_dag
operator: operators.DBTOperator
upstream_example_dag:
operator: operators.RequireUpstream
upstream_dag_id: example_dag
The YAML files are created at build time and deployed automatically through CI/CD.
We run DBT core with AWS Fargate, but this could also trigger DBT cloud jobs instead.
Development Velocity Wins
By automating dependency management between data models, we had some major wins:
- 12x Velocity Increase — Data modeling PRs would be open for an average of 9 days in our legacy system. This dropped down to 18 hours with DBT!
- 50% Time Investment Saved — By automating the deployment of data models, we saved the equivalent of half the time we invested in creating this system.
Careers at Hootsuite
If you enjoy solving data problems, or want to work at a data driven company, check out our openings at careers.hootsuite.com