Cover image for the current article, featuring the series title, the logos of Dagster and dbt, and the article title.

Declarative scheduling and the dagster-dbt library

Edson Nogueira
Indicium Engineering
6 min readJun 28, 2024

--

dbt has shaped data transformation, ensuring software engineering and DataOps best practices, such as version control, documentation, and data lineage.

By integrating dbt with Dagster, you can supercharge these features, providing a centralized pane of glass for the whole Data Platform.

This integration, combined with auto-materialization policies, enables cost-optimized workflows by avoiding unnecessary dbt runs.

In this article, we will delve into the unique features that make dbt a standout tool within the Modern Data Stack (MDS), how its integration with Dagster extends these benefits to the ingestion pipelines and the broader implications of this integration for data platforms.

The “D” in MDS

The Modern Data Stack methodology emerged from the necessity of balancing the I.T. and business ends in data initiatives. In summary, the main benefits of this approach are:

  1. Focus on business outcomes.
  2. Best practices from software engineering.
  3. Scalability by leveraging cloud technologies.

Notice that, so far, we did not mention any specific technology. This is because we can use any combo of technologies and call it a “Modern Data Stack”, as long as the above outcomes are satisfied.

As discussed in a previous article in this series, the open-source ingestion landscape has seen a plethora of technologies that are well suited to this purpose, such as Airbyte, Meltano, and dlt.

However, when it comes to transformation, dbt has established itself as the standard data transformation tool, given it perfectly encapsulates the 3 key benefits sought after when using the MDS approach, even though some tough competitors are emerging, such as SQLMesh.

Given this context, a natural question to ask is: what specific features of dbt make it so unique among the MDS tools?

By keeping an eye on the mentioned benefits, we certainly have to look for those features that bring business outcomes besides I.T. best practices. In particular, the data lineage feature helps achieving expressive results in an increasingly important aspect: data governance.

With the Dagster-dbt integration, one can extend this feature to the ingestion pipelines as well, enabling end-to-end observability for the whole data platform, as we discussed in our initial article of this series.

Image showing dagster-dbt integration in ingestion pipelines.

In the following, we will discuss in more detail practical steps to achieve these goals, building upon the EL structure where we left off in our public repo in the previous article.

dbt + Dagster

We will set up a simple dbt project based on the Adventure Works data we worked previously in our ingestion pipelines. For a thorough discussion of what a dbt project is and the main concepts behind it (i.e. models, tests, etc.) we recommend consulting the official getting started guide.

Now, we will share our thoughts on how to structure the dbt code location.

  • Code location structure

We organize our code location module as follows:

Image showing dbt code location module.
  • resource.py

This will define the key links to the dbt project, such as the dbt project dir and a proper dbt cli for running the project.

Image showing code for dbt project dir and dbt cli.
  • translator.py

This is where the mapping between features from the dbt project (e.g. models, tests) and Dagster assets are defined. The integration already comes with a pre-built translator, but the ability to define a custom translator allows you greater flexibility for your specific needs.

Image illustrating code for custom dagster-dbt translator.
  • assets.py

Here is where the magic of the integration happens: we use the translator and the multi-assets decorator provided by the integration to actually generate the Dagster assets.

Image showing code of integration that generates dagster assets.
  • automation.py

In this case, we leverage the Dagster declarative scheduling capabilities to define auto-materialization policies for our dbt models.

Image showing code of dagster declarative scheduling for auto-materialization for dbt models.
  • __init__.py

Finally, we create our Definitions object with the elements discussed so far.

Image showing dagster-dbt code of definitions object.

Cost-optimization with declarative scheduling

If you compare the automation.py scripts for the EL pipeline and for the dbt models, you will notice that, whereas we used a traditional schedule with the former, for the latter there was no mention of a cron expression whatsoever.

Instead, what was defined was a declarative schedule: we define the conditions under which we want the asset materializations to happen and let Dagster monitor and trigger them when those conditions are met.

This is another example of how Dagster can take the MDS to the next level with its data-aware approach.

In particular, this example of declarative scheduling solves a common problem in orchestrating data pipelines: the dbt models have dependencies between themselves and the ingested raw data in the warehouse.

The declarative scheduling approach, in this case configured using Auto-Materialization Policies (AMPs), can ensure that a dbt model will run when, and only when, its dependencies have updated. This brings two highly desirable characteristics for the data platform:

  • We do not need to worry about a fixed schedule running the dbt models unnecessarily in case there is no new data to process.
  • We ensure that, as soon as there is new data to process, the dbt models will be executed, minimizing the delay between ingestion and serving of new data.

Given the potential for improving cost-optimization and data availability, which are cornerstone factors in the design of data platforms, we can consider Dagster’s declarative scheduling features as a deciding factor in choosing it as an orchestration tool, as many new features in this area that are going to improve upon the current AMPs mechanisms will be available soon.

Conclusions

Although many technologies can be assembled together to execute a MDS methodology for a given data project, it is hard to imagine one stack that will not have dbt as its transformation tool, since it brings the core principles of the methodology by design.

When used alongside Dagster, the dbt native data governance capabilities are potentialized for the whole data platform, bringing data observability and a data catalog for free by simply putting the platform up and running.

We also showed how other Dagster’s declarative scheduling features can potentialize a stakeholder-favorite pillar of data projects: cost optimization.

In particular, an appropriate setup entails that data is processed as soon as it is available and no redundant compute costs in the warehouse due to dbt runs will appear at the billing report.

What is next?

In this third article from the Dagster Power User series, we discussed the transformation stage in the data platform with dbt and how to leverage Dagster features to potentialize data governance and cost-optimization capabilities.

After we have our business logic properly implemented in the warehouse, it is showtime (a.k.a. deploy time)!

That is what we will explore in the next article: how to deploy what we have built and locally tested so far to the cloud.

Specifically, we will deploy a Dagster instance and our code locations to Amazon ECS using a simple Terraform template.

Stay tuned!

--

--

Edson Nogueira
Indicium Engineering

I am a Mid-Level Data Engineer @ Indicium and Ph.D. in Physics.