Deploying and Running dbt on AWS Fargate

Cost-Effectively Automating Deployment and Transformations in your Cloud Data Warehouse

Venkat Sekar
Hashmap, an NTT DATA Company
9 min readJun 23, 2020

--

A common question that pops up across client consulting engagements is, “How can I have a CI/CD pipeline and execute dbt models?”.

There have been a number of approaches discussed to achieve this including:

And of course, there is the option of dbt Cloud itself.

But what if you had different criteria?

  • Offer a multi-tenant dbt cluster — an environment where multiple independent dbt pipelines execute without affecting each other.
  • Horizontally scalable, without a need to worry about sizing. As more dbt pipelines get added, there is little to no worry about how it will affect existing pipeline executions.
  • Billed only for the duration of the pipeline execution.
  • Ability to execute multiple versions of the dbt data pipeline.
  • Tagging of dbt data pipeline execution, so billing and cost analysis can be observed.
  • Service accounts used by the dbt data pipeline can be different.
  • Share the same cluster across multiple targets or multiple data sources.
  • Ability to trigger the pipeline using time-based schedulers or event triggers or even be invoked by a REST endpoint.
  • dbt CI/CD deployment tool, Azure DevOps, Circle CI, or Jenkins does not affect how the execution is implemented.
  • Logs are captured by cloud-native implementations.
  • Notification and alerts activated based on logs.

Thus an implementation closer to dbt Cloud, minus the UI, running within your environment would be a better fit for your criteria.

The benefits of this approach lead to:

  • The execution cluster can be maintained by an admin group, separate from the data pipeline development teams.
  • The dbt data pipelines are developed and deployed by separate projects or development teams.
  • You also want to run in an environment like Kubernetes, but your team has yet to learn and adopt it.

Much of the capabilities mentioned above can be done using existing capabilities in the Cloud. Interested? Follow along as I walk through the reference implementation we did using AWS Fargate. I will present the various choices that were taken into consideration as well as other aspects of the code/functionalities.

Solution Overview

For this reference implementation, the code base is implemented for AWS. The data warehouse of choice is Snowflake. The code is available in my GitLab:

Highlights

  • A python slim buster docker image is hosted in ECR.
  • The dbt data pipeline projects are packaged and hosted in an S3 bucket.
  • The service account, for logging into the data warehouse, is stored in Secrets Manager.
  • A dbt project-specific task definition is configured in Fargate.

Here are the execution steps:

  1. A trigger (Scheduler, Rest API, etc.) will make a request to instantiate the task definition in Fargate.
  2. Fargate will allocate a task-container and pull the configured docker image from ECR.
  3. An entry point script, packaged in the docker, will copy the dbt data pipeline and other necessary artifacts (explained in later sections) into the running container. Each container instance has 10GB of ephemeral space.
  4. A separate script “runpipeline.sh” will read the secrets from the secrets manager, export this as an environment variable, and invoke the dbt commands.
  5. All logs printed by dbt will be captured in CloudWatch.
  6. Once the dbt pipeline has finished, the container will shut down.

Design Considerations

Choice of the execution environment

For a multi-tenant, horizontally scalable and almost serverless-like functionality, Kubernetes (AKS, EKS or GKE) would ideally be the initial thought. In my various client engagements, however, it was pretty evident that not all client teams were ready to take on Kubernetes. The team would still need to learn and maintain worker nodes and different concepts like storage, load balancers, etc.

AWS Fargate (similar to Google Cloud run or Azure container services) provides a simplistic approach to hosting and executing containerized applications. Adoption and implementation are much simpler and does allow the team to migrate the containers into Kubernetes in the future if needed.

The pricing of AWS Fargate was also preferred, as you are billed only for the duration of container execution. You don’t need to have the container running 24/7. You can instantiate the container using APIs and once the container finishes its task it will shutdown.

Scalability is inbuilt into AWS Fargate, as you can spin up multiple instances of the container as needed.

Security is offered by IAM roles and policies which can dictate who can instantiate the containers. Also, the containers can execute inside your VPC and subnet of choice. The subnet can also be a private subnet, which would prevent external parties from interacting with your container. An additional aspect is that you cannot SSH into a running AWS Fargate container, which means the container is locked down.

All logs written to the console, by the container application, will by default get captured by AWS CloudWatch logs.

Why not Lambda?

Well long story short, Lambda has a current time limit of 15 minutes. While typically the data pipelines I had developed run less than 5 minutes, that timing would not be the same in all conditions.

The SQL complexity, DAG depth, and data volume might result in a longer execution time-limit. So, to be on the safe side, AWS Lambda was not the choice.

Fargate Tasks and Containers

Each Fargate container will host only 1 image of the dbt docker. This allows us to not have to worry about sizing. Each of these container instances would run a specific dbt data pipeline. The independent task allows isolation between multiple dbt data pipelines.

Dbt is a simple process and the model transformation is delegated to the data warehouse, like Snowflake or BigQuery. For that reason, the memory and CPU requirement is very minimal, so the choice for CPU/memory is 0.25vCPU with 512 MB. This is the lowest for the Fargate tasks; it allows our cost to be very minimal.

Docker Image

Fargate currently does not cache a docker image so that future invocation of the same docker image would result in faster processing time. Also currently Fargate downloads the image from ECR via the internet. What we can learn from this is that the docker image should be small for a faster download, and with it being small allows us to have faster boot-up time.

While there are multiple implementations of dbt docker images, we did not choose to use these implementations. The main reason was due to image size being large (like 500MB+), downloading time, and boot-up time. On the various tests I had conducted, the time to download & boot-up was ranging up to 2+ minutes in some cases.

Part of the reason that the image was large was due to the Python base which results in 200MB+. Add in dbt which downloads a lot of dependencies and the image size goes to 500MB+.

The way to overcome this is to split the Docker into multiple parts. To start off, I observed that it was much faster to boot with a small docker image and faster to copy tar gzipped content from S3. This concept is somewhat similar to the Lambda Layer approach.

  • The dbt library is captured in a virtual env, this is then packaged (tar gzipped) and stored in S3.
  • The dbt data pipeline project is also packaged (tar gzipped) and stored in S3. A later section would explain as to why this is not bundled into the docker image.

So, the custom Docker image we chose contains:

  • Python slim buster 3.7
  • AWS CLI

The image size in ECR is now 75MB, therefore downloading from ECR is faster.

Once the container gets instantiated, an entry point script will copy the dbt library and the dbt data pipeline project from S3. A virtual environment is activated with the dbt library and the dbt data pipeline is executed.

With this approach, the bootup time is 20–30 sec.

Why is the dbt data pipeline project not part of the docker image?

We wanted our solution to service multi-tenant. Preserving the dbt data pipeline projects into the docker image would result in multiple ECR’s and costs could potentially go up.

By keeping the dbt data pipeline project outside of the image, we could have just one ECR instance, and less docker images to maintain from vulnerability scans. Technically, since the docker image does not have the client code, we host this image in the public docker hub too.

How does the container instance look?

What are the scripts that you keep mentioning?

From the time the container gets instantiated, to the time the dbt model execution occurs, a set of pre-developed scripts are executed. They are as follows:

  • entrypoint.sh: This is present in the docker image. Its main purpose is to copy the content of S3 into local ephemeral storage. It will then proceed to invoke the run_pipeline.sh script.
  • run_pipeline.sh: This is not burnt into the docker image to allow flexibility. It is hosted in the S3 bucket along with the packaged artifacts. Once invoked, it will unzip the dbt package and invoke the dbt data pipeline script.

For now, I won’t go into the finer details, as this information is covered in the implementation scripts as comments.

Why not dbt rpc?

Using Dbt rpc would require you to host dbt in a container and this container might need to be available and up 24/7. You would end up increasing the size of the container as it might need to process multiple dbt data pipelines in parallel.

With Fargate, dbt is instantiated by the Task Run request, and the model to execute is passed in as a parameter. The logs are also captured in CloudWatch, so the “dbt rpc” approach would not fit.

Observations

Cost

During our multiple executions runs of sample projects, our billing incurred was < $2 USD/day. The mileage will vary in your environment when adopting this approach for wider projects.

Execution time

As mentioned earlier, the execution time varies across scenarios. For example, based on the data volume a typical initial data load could run longer while incremental updates could be shorter.

Logging

All logging is captured in the CloudWatch logs. I did not provide any implementation to react to the various logging errors/conditions. This will be an item to be implemented based on your individual use case and requirements.

Final Thoughts

This pattern and reference implementation have been implemented with some ideology based on the AWS Well-Architected framework. It might not address all the points, but is a start.

While this was done in AWS, a similar implementation can be done in Google or Azure as well. Though be mindful of cost & security aspects; this approach did not work as well in Azure.

Need Help with Your Cloud Initiatives?

If you are considering the cloud for migrating or modernizing data and analytics products and applications or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.

Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud (and dbt) service offerings.

Some of My Other Recent Stories

I hope you’ll check out a few of my other recent stories:

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap On Tap, on Spotify, Apple, Google, and other popular apps.

Venkat Sekar is Regional Director for Hashmap Canada and is an architect and consultant providing Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--