Deploying Dagster to AWS
Dagster is an up-and-coming , open source data orchestration tool. We, at Dataroots, saw value in leveraging the power of the public cloud and created a Dagster terraform module for AWS. This module will enable users to use Dagster in the cloud, managing all their data pipelines. This article starts with a short introduction on Dagster. Subsequently, the created AWS architecture to deploy the Dagster module, using terraform, is discussed. Finally some instructions on how to use the Dagster module are given.
Introduction to Dagster
What is Dagster
Dagster is a data orchestration tool that aims to ease the flow of data for machine learning, analytic and ETL tasks. To deploy Dagster, two services need to run: a web interface called Dagit and a background process called Dagster Daemon. Dagit is a GUI that acts as the front end of Dagster and displays information about pipelines, their runs and other Dagster services. Dagit can also be used to launch pipeline runs and check their results, as well as all the in-between steps. On the other hand, Dagster daemon is used to run schedulers, sensors, and manage run queues.
Why is Dagster used?
Nick Schrock, the author behind Dagster, explains why he felt the need to develop such a tool in his introduction to dagster. In short, he believes there are several problems with the way data applications are developed and used. The first problem he identifies is the control or rather the non-control over input data: as data flows through different applications, an invalid input can break the rest of the flow. Next, Schrock states that current data orchestration tools only focus on linking different data applications together and hence lose massive amounts of metadata and context. Finally, most of the data applications are under-tested which reduce maintainability.
Dagster introduces a new fundamental layer of abstraction into the data application development process and aims to solve the problems expressed above. Let’s look in the next section how this layer is built with the components Dagster propose.
Basic Dagster components
The most basic Dagster object is a solid. Everything is built by combining solids together. A solid takes an input, performs an action and outputs the result. Hence, they are the functional unit of work in Dagster. One or more solids connected together can then form a pipeline, which is another name for directed acyclic graph or DAG. In the following image, a 4-solids pipeline is shown.
Multiple pipelines can be grouped into a Dagster repository. The concept of a repository allows Dagster tools to target multiple data pipelines at the same time. Finally, scheduler and sensor are key concepts in Dagster as they allow to launch pipelines at fixed intervals or when a new pipeline run is required due to an external event.
Dagster example pipeline
Now that Dagster’s core concepts are laid out, let’s try to illustrate these with a simple pipeline. First, Dagster modules need to be installed in the environment:
pip install dagit && pip install dagster
For this example pipeline, the following dataset as a csv file will be used:
name,age,is_team_lead
Sander,24,False
Paolo,24,False
Ruben,30,False
Ricardo,26,False
Viktor,24,True
Charlotte,27,True
The pipeline should give the average age of the people in the dataset and the number of team leaders.
First, the dataset is loaded with a ‘’’load_p’’’ solid:
Then, the average age and number of team leaders is computed using the two following solids:
Finally, a solid, which displays the obtained results, is added and all solids are put together to create a pipeline which is added to a repository:
To run this example, run the command:
dagit -f name_of_the_pipeline_file.py
Finally, one can head over to the Dagit web hosted UI where a visual representation of the pipeline is shown. In the playground tab, it is possible to run the pipeline and see the results. If everything went well, something similar to the next image should be visible:
This simple example pipeline ran on a local machine, but what if the pipelines need to leverage the power of the cloud to run? The next sections explain how to easily deploy Dagster to AWS using Infrastructure-as-Code.
Deployment to AWS
What is infrastructure-as-code?
The increasing popularity of public cloud platforms (e.g. AWS, Azure, GCP) to develop, use and maintain IT infrastructures went hand in hand with an increased use of Infrastructure-as-Code (IaC). IaC is the concept of using a programming-based approach to provision and manage IT infrastructure with cloud platform services.
Using IaC has multiple benefits, main one being the ability to incorporate it with continuous integration/continuous delivery (CI/CD) pipelines. This way, the infrastructure can be automatically deployed after conducting thorough tests.
Furthermore, codifying the infrastructure allows for an increased reproducibility. Using IaC makes it possible to create multiple cloud environments (staging, testing,..) with the same configuration. In this project, Terraform was used as an IaC tool to deploy the IT infrastructure for the Dagster module in AWS.
Terraform
Terraform is an open-source infrastructure-as-code tool helping with the deployment and management of hundreds of cloud services. Terraform allows for creating resources in a concise manner using resource blocks which means that instead of writing ‘how it should be done’, terraform code should express ‘what should be done’. The following example illustrates the deployment of an S3 bucket in an AWS environment.
AWS deployment and testing
In order to create the terraform Dagster module, an appropriate cloud architecture should be chosen. This section will discuss the Dagster module’s cloud architecture and its different components in AWS.
- S3: An S3 bucket is created and will store data pipeline repositories and the configuration files Dagster need to deploy.
- ECS: Amazon Elastic Container Service (ECS) is used to host the containers running the Dagster daemon and Dagit continuously. Next to that an initialisation container will run once on startup. This container has the sole purpose of moving the files in the S3 bucket to the ECS shared volume and making them available to Dagit and the Dagster daemon containers. In doing so, both Dagster containers have access to the files (pipelines & configuration files) necessary for a successful deployment.
- Database: An Amazon Relational Database System (RDS) is created. The Dagster containers running in the ECS will write logs about succeeded/failed runs, scheduled jobs, etc… to this database.
- Load Balancer: An application load balancer (ALB) is used to provide a connection from the Dagit web-server to the outside world.
The average cost of this Dagster environment in AWS (minimal setup) is illustrated below:
This module was thoroughly tested with automated tests to check if no errors occurred during the deployment of the defined AWS architecture. Furthermore, additional tests were written to make sure all services were healthy and working as instructed These functionality tests consisted of the following checks:
- Check if the ECS Cluster exists.
- Check if the ECS Service is active and if a task is running.
- Check the status of the containers hosted on ECS.
- Check if the web server container is healthy by creating an HTTP(S) call to it.
Terratest, a Go library to write automated tests for infrastructure code, enables to create a test deployment in AWS and perform the defined tests to check the functionality and health of the different services. After those tests, the test infrastructure is destroyed.
How to use the terraform Dagster module
This section will elaborate on how to use the terraform Dagster module, proposed in this article. The module comes with a dagster_init_files
folder containing 3 crucial files. One file, syncing_pipeline.py
, is a provided pipeline with a specific purpose, which will be discussed shortly. The two remaining files are the two dagster configuration files, namely workspace.yaml’ and dagster.yaml
respectively. The workspace.yaml
configuration file provides Dagit with information about where data pipeline repositories are stored. This way, the Dagit web server has access to pipelines and can display them in the UI. The dagster.yaml
file defines all the configuration Dagster needs for deployment. For example, where to store history of succeeded/failed runs of data pipelines.
For now only the syncing pipeline is present and the workspace configuration file will look as follows:
The dagster.yaml
file will hold the additional, default configuration information, necessary for a successful Dagster deployment. These configuration files can be used together with the terraform module, to create the AWS infrastructure, as defined earlier.
Dagit will be running continuously and will be accessible via the internet, allowing users to see and manage all their data pipelines/schedules/sensors/… Since managing data pipelines implies creating new, additional pipelines or adjusting existing ones, the module should support this. Newly created/updated pipelines should be added to the S3 storage bucket. Furthermore, the workspace.yaml configuration file, in S3, should be updated such that the running Dagit web server knows where to look for the newly created pipeline. The updated configuration file will now look as follows:
Finally, the user should use the Dagit web server to launch a syncing pipeline run. This pipeline has one sole purpose, which is to sync files from the S3 storage bucket with the shared volume, used by both Dagster containers. Hence, by using this pipeline, the updated files will be present in the shared volume and accessible by both Dagster containers. As a result, the newly created data pipeline will show up in the Dagit UI web server. This enables users to manage multiple data pipelines, create new pipelines or adjust/change existing ones.
Conclusion
This article presented a short introduction to Dagster, using an example data pipeline. Next to that, the concepts of infrastructure-as-code and Terraform were introduced. The following section illustrated and discussed the different components of the proposed AWS architecture, needed for a successful Dagster deployment. Finally, instructions on how to manage data pipelines using the AWS Dagster module were given.
We thank you for reading this article and hope it helped you to understand the deployment of Dagster to the cloud. Similar to this module, we also created a Terraform module for deploying Dagster to Azure. More information on both modules can be found in one of the following places:
https://dataroots.io/
https://www.linkedin.com/company/dataroots
https://github.com/datarootsio
https://github.com/datarootsio/terraform-aws-ecs-dagster
https://github.com/datarootsio/terraform-azurerm-aci-dagster