Lyn Health’s Data Laboratory

Part 1: Deploying Dagster on ECS with GitHub Actions

Kevin Haynes
10 min readApr 6, 2022

Lyn Health is a healthcare company focused on helping the polychronic population (those who live with multiple chronic health conditions) in the US. The collection and application of data is key to this mission. Our ability to care for our members is impossible without having high-quality data available for use across the organization. Clinicians need data that helps them understand a member’s medical history, prescriptions and insurance benefits as well as data about the cost and quality of providers and facilities in order to refer members to the best care. Finance, operations, and sales teams need data describing the efficiency of our business. The data science team needs all of this and more in order to solve analytical problems, predict members’ future needs, and to come up with automated solutions to operational opportunities.

On Lyn’s data science team, we hold ourselves to high standards of analytical rigor and the ethical use of data. Additionally, we want to be good citizens of our communities, both those we live and work in physically as well as those that are virtual. This article is the first in a series intended to help educate and inform those who are working to solve similar problems in hopes that the next person may struggle a little bit less.

So What’s This Article About?

It is currently both a very exciting and simultaneously tumultuous time to be a Data Engineer. You can’t really look in any direction without stumbling into someone talking about the Data Mesh or some similar concept with a different name (we call ours the Data Laboratory). Companies are understanding that using data effectively is critical to building successful products. The buzz and business criticality is causing data engineering to be a far flashier role than in the past. There is no shortage of tools intended to help engineers better orchestrate and maintain “the mesh.” This article is not intended to sell you on any of the above points, however. If you’re here, it’s likely because you already know that a modern data platform is important.

This article will describe how we set up what we view as the most critical piece of the Data Lab: an orchestration service. For this use case, we chose to use an exciting open source tool called Dagster. This article will outline our first deployment of a Dagster instance and some of the lessons learned along the way.

What is Dagster?

Dagster’s adorable blue octopus logo

Dagster is “the data orchestration platform built for productivity.” Sounds pretty neat, doesn’t it? There are plenty of great articles, videos, and blog posts that help to understand what Dagster is and what it isn’t (hint: it isn’t Airflow) so I won’t get too into the weeds on this. At a high level, Dagster allows you to write Python to author, test, schedule, and monitor pipelines for producing and maintaining data assets.

Yes, there are lots of tools out there that you can use for that purpose, but they’re not necessarily built for that purpose. The reason our team decided to go all-in here instead of one of those other tools is that Dagster is built explicitly for data pipelines. It is data-aware where other tools attempt to be data-agnostic. While this may mean that you end up spending some extra development time defining your data-related assets in Dagster, the benefit you get back is that your jobs are more robust as a result. The way I see it, Dagster (combined with data quality measurement) is a potential solution for anyone who has worked with data assets and thought “how could someone possibly load data that is this terrible?”

It is worth noting that Dagster is newer compared to other, more generalized orchestration tools (e.g. Airflow, Luigi, etc.). The first version debuted in summer 2019; since then, they have had 14 major releases and are working on their 15th. One side effect of this is that the catalog of videos, blog posts, and technical articles (like this one) is a little slim for now, but that is changing rapidly. Dagster is starting to hire developer advocates and is maintaining collections of community use cases like this one. Their official documentation also becomes stronger every day, and it’s effortless to provide feedback via a button on the top right of every page of the site — and they actually read it, respond, and make changes quickly.

Along with any cutting-edge products (especially open-source ones) comes frequent releases. Put simply, this means that things have changed a lot and will continue to change. Though each release is thoroughly tested, well-supported, and robust, some teams using Dagster may not have the time, resources, or skills required to test and deploy new versions frequently. Our experience has been that each release is, in fact, straightforward to deploy without any breaking changes, and deployments on an older version can continue to run without updates. Even so, this can be something to consider for those evaluating the product.

At Lyn, we decided we were excited enough about the rewards of a testable, data-aware platform to accept the risks that come with frequent releases — after all, we’re also an early-stage startup who will need to prove ourselves in the marketplace. Additionally, once we started working with Dagster, we found a passionate online community with truly incredible support from the team at Elementl. The engineers building the software interact directly with users every single day, helping us understand how to implement and troubleshoot the software and gathering feedback and ideas for future product increments. Many companies try to get input from their customers, but when you work with Dagster, it feels like their team is in the trenches with you. As a “small but mighty” data team, this level of unrivaled support has been critical to getting things done quickly.

TL;DR: Dagster may be one of the newer kids on the orchestration block, but it is absolutely ready for super star production status.

Okay, that was enough editorializing. You came here for the technical stuff, right? Let’s get into it.

Because Dagster’s docs are great at helping you get a functioning instance on your local machine, we won’t cover that in this article (maybe we will in a future article if there’s interest). Instead, we’re going to cover how to deploy an existing instance to ECS using GitHub Actions.

AWS Infrastructure

As is usually the case with software such as Dagster, you have a multitude of options as to how and where you deploy it — and many of those options are in Dagster’s deployment guide docs. While the most detailed deployment guide is the one for Kubernetes (k8s) with the provided Dagster Helm chart, we decided to use Amazon ECS because it aligned well with our existing infrastructure for other services.

An AWS architecture diagram showing how the different services interact
High-level diagram of AWS infrastructure

For our deployment, we use the following AWS services (Amazon Web Services… services?). All of our infrastructure is created using Terraform.

  • CloudWatch: holds the logs emitted from the containers
  • Elastic Container Registry (ECR): holds the Dagster container image definition used to create ECS tasks
  • Elastic Container Service (ECS): hosts the Dagster service and its task that holds the dagit, and daemon containers
  • Fargate: the ECS containers are launched on Fargate so that we don’t have to worry about managing instances
  • Relational Database Service (RDS): used as the back-end run, schedule, and event log storage for our Dagster services
  • Secrets Manager (ASM): securely stores secrets we want to use in our containers as environment variables

GitHub Actions

Once the necessary resources have been provisioned in AWS using Terraform, we use GitHub Actions to create Continuous Integration and Continuous Deployment (CI/CD) workflows for our deployment.

The test workflow runs on every commit. It checks for flake8 compliance and runs our tests with pytest.

The deploy workflow runs when branches are merged to our main branch and handles updating the image in ECR and the task definition in ECS.

In order to get the deploy workflow functioning, you’ll need to have a few things in your project:

AWS Credentials

First, you’ll need to store secrets in your GitHub account that contain the AWS credentials that you want to be used by the workflow (lines 21–23 and 66–68 above). You’ll also need to ensure that these secrets are available to this project.

Dockerfile

Next, you’ll need a Dockerfile for Dagster. In our case, we were able to use the same Dockerfile for both the dagit and Dagster daemon containers. This sets the necessary DAGSTER_HOME environment variable and copies Dagster-specific files and our requirements.txt to the container. In this example, my_dagster_folder would contain all of the jobs, ops, schedules, sensors and resources for your Dagster instance as well as its workspace.yml and dagster.yml files.

ECS Task Definition

Finally, you’ll need an ECS task definition written as JSON. This is probably the hardest, or at least most finicky, part of this process because the syntax is specific to AWS. You can either write the definition from scratch (using the example below as a starting point) or build a new task in the AWS Console and copy the resulting task definition. The task definition is where you define:

  • The log configuration (CloudWatch in this example)
  • The containers’ attributes (image, name, entry points, port mappings, environment variables, etc.)
  • The task’s specifications (CPU, memory, IAM roles, network mode, etc.)

Here’s an example that we’ll break down below:

For the task, besides the container definitions, we need to define how much CPU and memory to allocate to the task, the task IAM role and task execution IAM role, the task family, the network mode, and how we want ECS to host the containers. In the above example, these are all set in lines 75–83.

The first thing you’ll see in each container is the CloudWatch logging configuration, which is nearly identical across all of the containers. You give the name of the log group, the region, and a prefix to add to the stream for the given container (so you can tell the logs for different containers apart).

Next, we map ports between the host and the container. dagit is hosted by default on port 3000, but for the sake of an example, our dagit container is hosted on port 8000 (lines 22–23) so we map the container’s port 8000 to the host’s port 8000 (lines12–17). The daemon container doesn’t need ports, so we don’t map anything there.

Next, we set entry points. Both dagit and daemon both share the same Docker image, so we set their entry points here instead of the Dockerfile. For dagit, we run the dagit command with options for the host and port as well as the location of our workspace.yml file. For the daemon, we just need to run it with dagster-daemon run.

I also wanted to show an example of how you would set environment variables as well as secret/sensitive environment variables using Amazon Secrets Manager (ASM). Both containers have environment variables (defined in the "environment":[] block) for the non-sensitive things our applications require (e.g. the name of the schema in your back end database). But we can also use sensitive values as environment variables for things like credentials and tokens. These are defined in the "secrets: [] block and, rather than taking a set value, they take the ARN for the secret in ASM. Make sure that the IAM role for the task has permissions to access the necessary secrets.

Other than the above, your container definitions just need a unique name and the ARN of the image in ECR that should be pulled.

Digging into Image Tags

In our example task definition, each container specifies the latest tag, but the actual tag that ends up deployed for the Dagster containers is set to the GitHub SHA by our deploy workflow. Looking back at our example workflow, let’s zoom in a bit to understand how this works.

After checking out the code, configuring AWS credentials, and logging in to ECR, we update the image in ECR. You’ll see three environment variables for this step: the ECR registry (which is the output of the previous step), the name of the ECR repo to update, and the IMAGE_TAG, which is set to the GitHub SHA. This step then runs a docker build and docker push with that tag, and finally echoes the image as a step output, so that it can be used by other steps.

Now we can use that output to set the correct tag for our task’s container definitions. However, only one container’s task definition can be updated at a time, so we need to do it twice. The first time, we pass in the task definition JSON from our project (line 5 below) and the output from the above ECR update step (line 7 below) to set the tag of the dagster-dagit container. The next time, rather than again passing in the task definition from our file, we pass in the output of the previous step (line 13 below). I put this in bold, because it caused hours of confusion for me before I figured out what I was doing wrong.

Congratulations!

If you made it this far, you should have…

  • A functioning task in ECS hosting dagit and the daemon
  • A CI/CD pipeline to deploy your Dagster instance any time you change it

From here, you can start building all of the Dagster goodies that transport your data to the right place at the right time.

What’s Next?

We hope you found this article helpful. There are plenty of topics and technologies here that could warrant their own article. We plan to publish more articles like this about the application of data at Lyn Health, such as…

  • How we use Great Expectations to evaluate the quality of data throughout pipelines
  • How we use Docker Compose for local development of Dagster
  • How our data scientists apply data to help our members and Care Circle

What do you want to hear about from us? Do you want a deeper dive on some of our infrastructure in Terraform? Some code snippets to help you deploy your own Dagster instance on ECS? Let us know in the comments — all your feedback is appreciated!

About the Author

Kevin Haynes is the Principal Data Engineer at Lyn Health and formerly held various data-related roles at Nordstrom, Inc. Kevin’s main professional goal is to build the tools and culture that make awesome people want to come to work. Besides technology, Kevin enjoys writing, geeks out over a wide range of sports and recreation activities and he is always down to pet your dog.

--

--