Cover image for this Medium article, with Dagster and dltHub logos, the series title above the logos, and the article title below.

Data Ingestion with the dagster-embedded-elt Library

Edson Nogueira
Indicium Engineering
7 min readJun 7, 2024

--

Open-source data ingestion can be challenging without a clear go-to solution.

But, the dagster-embedded-elt library can simplify this process with lightweight integrations.

In this article, learn to handle databases and filesystems with Sling and unstructured data with dlt to improve the process of ingestion when building a data platform.

Data Ingestion and the Paradox of Choice in Open-Source EL

When it comes to data ingestion, we are faced with a couple of scenarios:

  1. Proprietary tools, such as Fivetran: amazing capabilities, easy setup, but considerable increase in bills. If you can handle justifying the high spending with EL pipelines, then you are in a dream scenario.
  2. Open-source EL: the company wants all benefits of proprietary EL tools, such as robustness and scalability, but spending the least possible. In this scenario, we are faced with a situation similar to the paradox of choice: with so many options to choose from, without clear winners such as dbt for transformation and Snowflake/Databricks for data warehouse, picking up a subset of tools to define an EL stack can be more difficult than it might seem.

So, assuming you are in the second scenario, you might be considering tools such as Meltano or Airbyte. Both have their advantages and shortcomings and, depending on the teams’ affinity with those tools, each of them might be the right choice.

Nevertheless, when we have no outrageous advantages to define a specific tool, ease of integration with the remaining data stack begins to gain importance.

And here is where the dagster-embedded-elt library provides an opportunity to use some data ingestion tools that are not mainstream in the EL landscape yet, but which have the potential to be so:

  • Sling: can handle most databases/warehouses and filesystems you are likely to encounter.
  • dlt: shines when we need to build pipelines for unstructured data (e.g. APIs).

This duo covers most of, if not any, data sources found in a typical data project and considerably reduces development time of Dagster assets due to lightweight integration maintained by the Dagster team itself with the help of its vibrant and growing community.

How Dagster Integrations can be beneficial for data ingestion

To fully understand the importance of an integration, and the benefits of using one that is maintained by the Dagster team itself, let’s recall some important Dagster concepts:

  • Op: is the core computational unit of the framework. In simple terms, it is “what actually runs” when you materialize an asset. This is the analog of the task
  • Asset: in simple terms, it is a combo of op + metadata: in addition to what executes, it has an additional layer with metadata to translate technical execution into business results. As a bonus, this metadata is accessible in an easy-to-understand way in the Dagster UI.

In this context, a Dagster integration allows you to only focus on using your tool of choice — in the same way you would without any orchestration tool — and, then, translates what you have done (“what actually executes”) to Dagster assets.

For instance, the dagster-dbt integration is what we would call a gold standard for such libraries, and makes orchestrating dbt models as seamlessly as possible.

That being said, it is important to note that an integration is not strictly necessary to use Dagster with any tool: you can always write your own factory-like code to generate Dagster objects (assets, jobs, etc.) for any tool as well.

However, using an integration is strongly recommended, as it heavily simplifies the development process and scalability of ingestion pipelines.

Better yet when such integration is maintained by the Dagster team itself (as the dagster-embedded-elt), which makes it more likely that you will be extracting the full potential of Dagster features.

Keep reading to find out how to structure projects combining Sling and dbt with Dagster, to improve the data ingestion in your projects.

Sling + Dagster

Here we will share our thoughts on how to structure the project for easier maintenance and faster development with a replication from S3 to Snowflake as example.

  • Code Location Structure

We organize our code location module as follows:

Image illustrating code location module for data ingestion.
  • replication.yaml

The core object of this setup, which defines the Sling streams that will be converted to assets in Dagster.

Image showing code defining Sling streams that will be converted to assets on Dagster for data ingestion.
  • resources.py

Defines the Dagster resources for interacting with the Sling CLI.

Image illustrating code of Dagster resources for interacting with Sling CLI in data ingestion.
  • translator.py

This is where the mapping between features from the original objects (e.g. Sling streams) and Dagster assets are defined. The integrations already come with a pre-built translator, but the ability to define a custom translator allows you greater flexibility for your specific needs.

Image illustrating code of mapping between features for data ingestion.
  • assets.py

Here is where the magic of the integration happens: we use the translator and the multi-assets decorator provided by the integration to actually generate the Dagster assets.

Image illustrating translator codes used to generate Dagster assets for data ingestion.
  • automation.py

Here, we define a job with a schedule to actually orchestrate the materializations of our assets.

Image showing code that defines a job with a schedule to actually orchestrate the materializations of Dagster assets.
  • __init__.py

Finally, we create our Definitions object with the elements discussed so far.

Image showing code creating definitions object for data ingestion.

dlt + Dagster

Similarly to the previous section, we now explore an ingestion of LATAM countries GDP data from the World Bank API to Snowflake as an example to share our thoughts on best practices for the dlt integration.

We will skip the automation.py and __init__.py discussion for this case, as it works almost unaltered from the Sling scenario, with only straightforward modifications mapping Sling → dlt.

  • Code Location Structure
Image illustrating data ingestion code location structure.
  • source.py

The core object of this setup, which defines the dlt resources that will be converted to assets in Dagster.

Image illustrating data ingestion code of the dlt resources to be converted to assets in Dagster.
  • destination.py

We define the destination explicitly for better management of environment variable credentials.

Image showing code for definition of destination in the data ingestion process.
  • translator.py
Image showing code for the translator.py for data ingestion.
  • assets.py
Image showing code for the assets.py in the data ingestion process.

After both Sling and dlt assets are defined, the outcome in the UI will look like the following:

Image showing both Sling and dlt assets in the Dagster UI.

You can check our series’ public github repo where we will parallel the discussions in our series with a concrete implementation of a Modern Data Platform with Dagster + embedded-elt + dbt.

Conclusions

Choosing the right stack for a data project always involves many tradeoffs such as costs, development time, team experience, and so on.

Especially when it comes to open-source data ingestion tools, where no player really stands out from the sea of options — differently from dbt in the transformation helm, the weight of the ease of integration between the overall stack increases.

In particular, when using Dagster, the dagster-embedded-elt library provides a seamless integration which covers most of the data sources found in any data project: for instance, databases and filesystems with Sling and unstructured data with dlt.

It provides a sufficiently good starting point for building pipelines in a cost-effective and scalable way. After the pipelines are working, for sure the stakeholders will be happier than if they had a non-negligible bill for data ingestion only, as would happen with proprietary tools.

What is next?

In this second article from the Dagster Power User series, we discussed the first stage in the logical chain of the building of a data platform: ingestion.

After we have our raw data in the warehouse at our disposal, it is time for the Analytics Engineers to transform it with dbt to implement the business logic.

That’s what we explore in the next article: how Dagster really shines for its powerful integration with the only certainty in the Modern Data Stack, allowing unparalleled observability and governance. Stay tuned!

Click here to read the first article of this series if you haven’t already.

--

--

Edson Nogueira
Indicium Engineering

I am a Mid-Level Data Engineer @ Indicium and Ph.D. in Physics.