Serverless dbt workloads using AWS Batch

Jevgenij Melnik
Revel Systems Engineering Blog
4 min readJun 21, 2022

dbt plays an important role in our data platform. It is much loved by our data engineers — helps us structure our data warehouse schema and orchestrates SQL transformations in our ELT pipeline.

Since dbt uses SQL to orchestrate transformations, don’t be too hasty to assume that it is a tool only for slow and boring processing on the databases. Vendors offering scalable data warehousing and real-time database solutions use extended SQL as an interface — behind the scenes offering awesome features which can take your data pipelines to the next level.

We use it for transforming newly arriving data in mini-batches, historical processing data, building new data models, and ensuring historical reference data integrity in our data marts.

Data ingestion in intervals

Due to the bulky nature of our transformation workflows, it just makes sense to have them running on a schedule or on-demand.

You can run your transformations as often as you need them, but the throughput and execution time eventually boils down to the amount of data, frequency, latency in transfer, IOPS, processing power and all the other bottlenecks and skew in your data pipelines.

If you are thinking of running scheduled batch processing anywhere along your data ingestion and transformation pipelines, AWS Batch might be a good option for you.

AWS Batch > ECS Fargate > Docker > dbt > Python > AWS Redshift

AWS Batch is a perfect service for hosting our batch transformation jobs. It allocates configured CPU and mem resources for the duration of the job and hence serverless — you pay for what you use.

It is also a service we are all familiar with since it runs on top of the ECS, which lots of developers and ops are familiar with. It is ECS Fargate that makes the setup extremely easy, lifting the need to manage ECS clusters, target groups, and EC2 instances.

Under the hood, it runs a Docker container of your choice, e.g. Python + dbt for us. Containerisation is a great way to guarantee an execution environment, whether you run it from your Local, CI/CD, or Production.

Orchestrated AWS Batch workload using ECS Fargate

AWS Batch offers some simple UI for running the batch jobs, with the ability to configure the command you run or environment variables, making it flexible and versatile at a core.

Batch Job configuration

Batch jobs can then be orchestrated and scheduled using AWS Step Functions + Event Bridge, Apache Airflow, or maybe even using your CI/CD Pipeline with some hooks.

AWS Batch integration with AWS State Machines
Step Function executions
Batch Job executions

With AWS Batch running on Fargate Spot instances it costs us 0.01$ for dbt to orchestrate our daily SQL transformations. This obviously excludes any Redshift costs where actual processing happens and will vary on your provider.

We have also evaluated other alternative AWS solutions which are meant to tackle similar workflows in the AWS Analytics product suite:

  • AWS Glue scheduled jobs were along the lines, but are more tailored for ETL workloads with Apache Spark, and are best suited for transformations on S3 data lakes.
  • AWS Lambda was far off the batch processing, but it does fall under ETL patterns, micro-batching, and real-time processing on streaming data, but so is EMR and Spark Streaming.

Since we are working with an Agile mindset and use the Scrum framework to organise ourselves, we first had our dbt workflows running using GitLab runners, but naturally soon grew out of it and required something more stable and production-tailored.

AWS Batch works great for us in the meantime and helps save money — everything goes into the Redshift lol. We also use AWS Step Functions to orchestrate our transformation workloads and EventBridge for scheduling.

This is not the last stop for us as we mature and expand our BI, Analytics, ML, and Data Platform teams here at Revel. Peek in for available positions; we are hiring.

--

--