Orchestrating Data ETL with State Machines
Pro tips for using AWS Step Functions
Introduction
Welcome to this first post by PromoFarma’s Data Team!
In this post we will explain how we have used Step Functions on Amazon Web Services to revolutionize our ETL process, and in future posts we look forward to sharing about other projects that we are working on within the fields of Data Engineering, Data Science, and Machine Learning.
To put this post in context, we had been using AWS Data Pipeline to execute nightly jobs in order to load our data warehouse (Redshift) from our transactional database (RDS) and to perform data transformations, but we outgrew that solution. In particular, it was not fast enough considering that we wanted to start running our ETL more than once a day; in order that our data warehouse would be more up-to-date, which would enable us to make more up-to-date Data Driven Decisions.
Objective
Efficiently and reliably load our data warehouse with fresh data, multiple times a day, in a scaleable, parallelized fashion, while also handling task dependencies.
Examples
- Step Function w/ Multiple, Dependent Levels of Parallel Processes
A Step Function consisting of four “Parallel” states, to create the four levels, where level 1 tasks will not start until all of the level 0 tasks have finished. Level 0 has “Branches” for each table, each with an RDS (transactional DB) to Redshift (data warehouse) ETL “Task”. The subsequent levels depend on those tables, performing Redshift to Redshift tasks. For example, inserting the table rows which have changed into a historic table also in our data warehouse. So, this structure is useful to parallelize tasks while also handling the dependencies.
- Step Function w/ Catches, for Parallel Processes w/o Dependencies
A Step Function consisting of a single “Parallel” state with “Branches” for each table, where each branch has an RDS to Redshift ETL “Task” and a “Catch” state to handle any errors. Without the catches, if one of the tasks running in parallel had failed, the entire Step Function would fail, the other tasks would be automatically stopped. So, this structure is useful to parallelize tasks without dependencies.
- Step Function Loop
A Step Function where the last “Task”, Relaunch, is a Lambda Function which evaluates a condition and if this condition is not yet true it starts a new execution of this Step Function, in order to simulate a loop. For example, we have used this structure to run the same sequence of tasks, one month at a time, up to the current month.
Noteworthy Benefits
- Pricing
You pay by the number of step transitions executed, but that cost is just $0.0250 (cost as of July 2018) per 1000!
You of course also have to pay for the compute on which each state’s task is run, but especially if you use Lambda Functions or a server that you’re already paying for, that cost will be quite low as well.
- Flexibility
With Step Functions, the possibilities are endless! Any state machine structure which can be created with the Pass, Task/Activity, Choice, Wait, Succeed, Fail, Parallel, Catch, Retry, Timeout, etc. building blocks. Additionally, the tasks/activities are flexible in that they could be written in any language and executed on any compute hardware, even within the same Step Function each could have a different implementation.
- Visual Console
Our Step Function executions are started programmatically and alarms notify us of any errors, but still, we find it beneficial that a visual console also exists.
In particular, when first creating, or later modifying, a Step Function’s structure you can refresh the visualization of the states and transitions as you write or change the JSON.
Also, when debugging an error, the console shows you the exception message from the compute layer. Additionally, you can use the console to check on which Step Functions are currently running.
Plays Well With
- AWS Lambda
Step Function tasks are designed by default to run on Lambda, just select the specific Lambda Function’s ARN when specifying the “Resource” property for an “ActivityState” object in the JSON.
The Lambda Function will receive the Step Function input (according to the state’s InputPath, see the following Gotchas section for more info) in its event
parameter and the Lambda Function’s return will be its output (according to the state’s ResultPath) to the Step Function.
- AWS Redshift COPY FROM S3
The fastest and most reliable method of loading data into a Redshift table is with the Redshift COPY command. This data must already be in S3 (see the following bullet point), and I would suggest creating an IAM role to give Redshift permission to read from S3 instead of always using your AWS keys.
COPY table_name
FROM 's3://'
DELIMITER AS '|'
TRUNCATECOLUMNS
ACCEPTANYDATE
ESCAPE
CREDENTIALS 'aws_iam_role=arn:aws:iam::'
- AWS Aurora SELECT INTO OUTFILE S3
The fastest and most reliable method of extracting data from RDS into S3 is with the Aurora SELECT INTO OUTFILE S3 command. If your RDS database is not already Aurora (just the standard MySQL engine, for example), you can create and then extract using an Aurora read replica instead.
query
INTO OUTFILE S3 's3://'
CHARACTER SET utf8
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
- AWS CloudWatch Alarms
Within CloudWatch (under the namespace States), there are a variety of Step Function metrics available, in particular, ExecutionsFailed which is useful to create alarms. Do note, though, that if you use catches, the Step Function’s execution will be reported as successful.
- Version Control
Since the structure of a Step Function is just JSON, you can easily store that configuration in your version control. Additionally, we chose to write our ETL task Lambda Function as a template. For RDS to Redshift tasks it takes as input the SELECT query to run on our transactional database to extract the data. For Redshift to Redshift tasks it takes as input the INSERT to run in our data warehouse to transform the data.
So, our single ETL Lambda Function is re-usable, straightforward, and rarely requires code changes. The code which does change, the specific SELECTs and INSERTs, our Lambda Function reads from files in S3. This is advantageous since we can manage that code in our version control and simply use a deploy script to update that S3 bucket with the latest version, or rollback to an earlier version, of all the files.
Gotchas
- AWS Lambda Functions have a maximum execution time of 5 minutes
By default, Lambda Functions will time out after 3 seconds, and you can only increase this maximum execution time up to 300 seconds. So in Step Functions, you will see the exception: Error: Lambda.Unknown
Cause: The cause could not be determined because Lambda did not return an error type.
To get around this limitation you could split that state’s task into multiple, shorter tasks, or instead, use “Activities”. Activities allow you to run a state’s task on any sort of compute worker, not just Lambda. An Activity is easy to configure and the worker code is simple to write (use get_activity_task
, send_task_success
, and send_task_failure
from the AWS API), but this does mean that you will need to keep a worker running continuously, polling Step Functions to see if there is a new task to execute.
- InputPath
With "InputPath"
you select the portion of the Step Function input which is passed to the task. By default, it is the entire input, "$"
, but that was not particularly useful for us since we wanted to re-use the same Lambda Function and just pass each task a different part of the input.
{
"inputs": [
"{\"redshift_table_name\": \"rds_table_0\",
\"redshift_mode\": \"overwrite_existing\"}",
"{\"redshift_table_name\": \"redshift_table_1\",
\"redshift_mode\": \"script\"}"
]
}
So, for example, if the Step Function input is the JSON above, you can use "$.inputs[0]"
for the first task and "$.inputs[1]"
for the second task.
Conclusions
Switching to AWS Step Functions for our Data ETL reduced our overall job time by about 80%, so we would certainly recommend it!
In addition to the time efficiency, this switch has reduced our costs and our ETL process is now altogether more manageable, reliable, and scaleable.