Dagster + dbt: a New Era in the Modern Data Stack

Edson Nogueira
Indicium Engineering

--

It does not matter how good your data platform is if you don’t manage to convince stakeholders of its value.

With Dagster + dbt, both technical and business people can have a clear picture of the value the data platform used in the Modern Data Stack (MDS) is bringing to the table in their own perspectives.

What is Value?

Knowing how to use the potential of the Modern Data Stack precisely to make it tangible for interested parties.

Keep reading and you will understand…

Generating business value with Modern Data Stack

We hear all the time that the goal of any data team is to generate value to the business with data, for example, using Modern Data Stack, but at the same time it is not always clear if people share a common notion of value in these contexts, especially when there is not a direct link between what is being built and an immediate monetary return.

For the purposes of our present discussion, let’s use the following definition commonly employed by the Indicium’s data products team:

Image illustrating Indicium’s team’s definition of value when applying Modern Data Stack: perceived benefit vs perceived effort.

The word perceived here is key: it emphasizes that it is a subjective concept, and depends on the perspective of the person who is assessing the value.

And it is precisely for failing to recognize that the technical and business perspectives on the data platform’s value very often do not agree that many data initiatives fail, as they do not gain the required stakeholder’s buy-in and resources to be fully developed.

A necessary paradigm shift for the Modern Data Stack

It is worth mentioning that the trend of data engineering teams needing to move up in the value chain was already presented in the 2022 book Fundamentals of Data Engineering, by Joe Reis and Matt Housley.

In practical settings, however, what we usually end up seeing is: data platforms with healthy pipelines, good documentation and processes — which help decrease the denominator in the value expression — but without an easy to digest, readily accessible panel of what is actually being produced by the pipelines — which keeps the numerator is practically unaltered.

With Dagster, we change our rationale about data pipelines from tasks to be executed towards assets to be materialized.

Using the precise definition from the Dagster docs, an asset is an object in persistent storage, such as a table, file, or persisted machine learning model.

This might not seem such a big deal at first, but combined with that asset-first philosophy, the design of the tool and the observability features entailed by it dramatically increase the perceived benefit numerator of the value equation.

Therefore, applying that in your Modern Data Stack can be a game-changer. Keep reading to understand more.

Perceived benefits

Remember that killer dbt lineage feature?

Can you imagine such a feature extended over your entire data pipelines, all the way from the ingested tables to the serving layer?

Besides that, imagine if, together with the lineage graph, we also have details about the last execution of the assets, easy access to run execution logs, all centralized in the UI of a single tool?

With Dagster, we get all that, and essentially a data catalog for free, simply by spinning an instance up and running.

For instance, the following example uses the dagster-embedded-elt library, which provides native integration with Sling and dlt, together with dbt, so that we can see such comprehensive lineage and materialization information:

  • EL assets constructed using Sling:
Image illustrating EL assets constructed using Sling in the context of Modern Data Stack.
  • dbt assets constructed using dbt
Image of dbt assets constructed using dbt in the Modern Data Stack context.
  • Global asset lineage, which can be expanded as needed with the use of asset groups:
Image illustrating global asset lineage expanded with asset groups in a Modern Data Stack setting.

Battlefield test

Alright, all that fancy design and game-changer rationale about data pipelines is beautiful, but how does that actually stand in a real-world project of global reach, with a lot at stake?

Well, of course we cannot give a definitive answer, but we can share the insights of our recent experience applying all of these concepts in practice with a first-class client money-wise.

As the data platform engineer who has executed most of the technical tasks in the project, I would like to talk first about the development experience, where we can basically emulate our production setup in our local machine and be confident when pushing locally developed pipelines to production.

That takes us to another level in development speed, and such a fast lifecycle and iterative development enables us to address many data pipeline issues orders of magnitude faster than with conventional tools.

Now, bringing back the focus towards the stakeholder’s perceived benefits, it suffices to say that, in our first meetings with the client, they said to us that their demands were mainly regarding the observability and monitoring aspects of the data pipelines: they would like to know what, when, and why something went wrong.

Once we finished our PoC with a couple of ingestions, they were simply mesmerized by the results: in addition to all they wanted, they could have a global view of all assets being produced.

In addition, along the implementation it was possible to see a massive reduction in RTO (weeks to hours) and infrastructure spending.

Since then, our data platform team gained immense buy-in to take the lead in basically every major decision regarding data architecture and software engineering aspects of the platform, which unfortunately is a rare occurrence in data initiatives.

I hope we pave the way for that occurrence becoming more frequent by our teams adopting the right Modern Data Stack tool for the right outcomes.

Conclusions

In general, analytics engineers are closer to the business stakeholders in day-to-day operations, and their delivery is much more tangible, so that they can easily convince stakeholders of the value they are bringing to the table.

That is rarely an option for data platform engineers, as it is hard for people without a technical background to abstract the value behind phrases such as:

“The tasks are executing without failure, so those tables that our Analytics Engineers need to build your DW are reliable”.

Instead, let them easily see what you are building, when it was materialized — and, therefore, reliable — for downstream consumption.

Make it so easy for them to understand how what you are building connects with the data warehouse and ML models.

No longer try to convince a stakeholder of the value of your platform from the engineer perspective: simply increase the value directly on their own.

Dagster + dbt in the MDS: What is next?

We hope you enjoyed our first article in this Dagster series, where we emphasized the single most important reason we believe it has the potential to begin a new era in the Modern Data Stack: making the platform value tangible for stakeholders.

Next week, we will start giving more details on that journey to generate value with Dagster beginning with the first step on each data workflow: ingestion. Specifically, we will talk about how we can leverage Dagster’s native integrations with open-source EL tools to build functional and cost-effective data workflows.

Stay tuned!

--

--