A “Less Server” Data Infrastructure Solution for Ingestion and Transformation Pipelines — Part 1
Written by Michael Triska — Data Engineer at AMARO
An Almost Serverless Data Architecture for Orchestrating ETL Jobs Using AWS Step Functions, AWS ECS and Snowflake.
The main takeaway is that there is no reason to make an all-in choice on only one type of computation, but when it comes to designing a data ingestion and transformation pipeline architecture, removing complexity and technical debt should be a top priority. But let’s take a step back.
At AMARO, we now run small to medium-sized Python and SQL tasks against our data warehouse Snowflake with AWS ECS Fargate managed containers orchestrated by AWS Step Functions. This blog post will outline a discussion as well as the decision drivers aimed towards a solution to integrate long-running data integration and transformation processes into our data workflow while still maintaining less infrastructure. A future blog post will lay out the technical dimensions of this infrastructure solution.
Hi Airflow. See You When I See You.
Before proceeding to examine high-level architecture details, it is important to review our previous data orchestration stack. Before, we used Airflow as our main data orchestration tool submitting SQL queries to Amazon Redshift and unloading data to S3.
Airflow has been thought of as one of the key open-source data workflow management tools available, however, in our view, it was frustrating due to operational headaches caused by maintaining servers. Moreover, by drawing on the concept of incorrect workflow abstraction, Laughlin (2018) has been able to show the main weaknesses of Airflow as well as why “We’re All Using Airflow Wrong and How to Fix It”. Another recent significant discussion on the subject “Why Not Airflow?” was presented by White (2019). As White notes: “Airflow’s applicability is limited by its legacy as a monolithic batch scheduler aimed at data engineers principally concerned with orchestrating third-party systems employed by others in their organizations. Today, many data engineers are working more directly with their analytical counterparts.”
Strategies to enhance data infrastructure stack might involve making huge, high-stakes, and irreversible decisions for the ongoing data platform. There is no “one size fits all” data infrastructure solution; in our case, however, the decision to migrate from Airflow to a custom data orchestration tool (or from Redshift to Snowflake) fits the organizational requirements to be the foundation for all of our future data infrastructure. For us, this period was a blur of overwhelming excitement and adrenaline, and in the end, I am proud of what the data team has accomplished.
Unlike traditional data orchestration architectures, we started to run stateless and event-triggered compute containers with AWS ECS Fargate that are controlled by AWS Step Functions. Image 1 shows the overall high-level architecture drawing of the solution implemented.
Most data ingestion or ETL applications execute a lightweight and specific piece of business functionality like calling an external API, doing some minor transformation on this data and ingest it to a database. Due to their lightweight properties, those applications only need a minimal operating system, libraries, and components. Therefore, we designed a generic slim Python container that contains exactly what is needed for most of the applications like SQL connectors and the most important Python libraries for ETL jobs.
The main flow of events starts with an AWS Step Functions state machine triggered through AWS CloudWatch. Let’s see how we orchestrate our workflows (see image 2) at AMARO:
- AWS CloudWatch triggers AWS Step Functions based on a schedule.
- Once an AWS Step Functions state machine is started, it controls and starts an AWS ECS Fargate launch type task. Note that for now, our AWS Step Functions do not declare a complex order of operations; this is done within the applications themselves. Here we consider tools like Dagster or Perfect.io to standardize our Python codebase; for SQL transformations we use dbt.
- The Fargate task runs a pre-built Docker container stored in AWS ECR: The container clones an application stored on Github and executes it. Applications either default SQL statements to our data warehouse and/or run Python applications to ingest or transform data into Snowflake. In special use cases, we use Snowflake’s Snowpipe to ingest data to save warehousing costs. Here, we could also easily coordinate other AWS services.
- If the application raises an error, an AWS Lambda function gets invoked which informs the team and the product owner via different channels.
The solution enables us to manage data ingestion and transformation pipelines with more ease and flexibility than before reducing responsibilities for operating cloud infrastructure and providing the opportunity to reallocate time and people to problems unique to the organization. These workflows have significant implications for planning, building, and deploying software in a way that maximizes value by minimizing undifferentiated heavy lifting.
Another big factor in the attractiveness of this architecture is pricing. We only pay US$2.77 per day for 11 separated ingesting and ETL processes running every day. That’s it.
Don’t Let the Word Serverless Confuse You
Serverless has gotten a lot of buzz and wide adoption among developers and marketing campaigns as well as become an architecture trend since it allows for massive decoupling of components. In the case of data ingestion, ETL or machine learning applications, a Lambda-based architecture is often just not an option because of inevitable long-running and memory-intense data application processes.
One valid question that needs to be addressed is whether an existing container-based application does become serverless just by using an AWS ECS Fargate launch type. Fargate greatly reduces the overbearing complexity of orchestrating ECS containers, but you’re still defining your application-level architecture like Cluster, Service, Task Definition and Container Definition as well as CPU and memory requirements, which can be confusing at first when you never worked with a container infrastructure. However, Amazon does a great job of providing a console workflow that allows you to save a Cloudformation template that you can revisit properly (e.g. for security configurations).
By decoupling components like the job scheduler, application code and heavy executions into Snowflake, our data orchestration architecture with AWS Step Functions allows for the utilization of a computing model that abstracts away infrastructure management, allowing developers to focus primarily on business logic by (at most) writing code that consumes other services. Roughly speaking it’s an event-driven, utility-based, stateless, code execution environment in which you write code and consume services.
This definition contributes to our understanding of real serverless data ingestion and transformation infrastructure. Here, Fargate helps to bridge the gap between serverless and containers.
Cloud services make it easy to build a modern application that doesn’t require a lot of maintenance and is also cheap. Our new data orchestration abstraction is a huge relief from the massive operational burden we had in the past, and is powerful enough to let us build applications and deploy them quickly without issues like having to consider server capacity, dealing with storage running out of space, operation systems or accumulating logs.
Using this simple orchestration workflow with AWS Step Functions and AWS ECS Fargate launch type, we chose an architecture that auto-scales and requires no manual intervention once it is set up so that we can focus exclusively on the business value that matters and is fun — data.
Another significant aspect is to realize that there is no reason to make an all-in choice on only one type of computation. With AWS Step Functions it is easy to choose the right abstraction layer of orchestration that gives us access to a variety of other AWS services adapting in many additional ways for each use case of our applications.
Are you also feeling sick and tired of complicated abstractions stopping you from writing code? Do you want to discuss other ETL architecture designs or why we didn’t use AWS Glue as the main ETL engine? Comment below!
Laughlin (2018). https://medium.com/bluecore-engineering/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753. Last viewed 25.10.2019.
White (2019). https://medium.com/the-prefect-blog/why-not-airflow-4cfa423299c4. Last viewed 25.10.2019.