Serverless ETL with AWS Step Functions

--

Diagram showing a simple state machine transitioning between three states.

Hi, I’m Jamie, a product engineer at John Lewis & Partners. I’m currently building AWS hosted microservices for Waitrose in the Identity team.

We recently used AWS Step Functions in conjunction with AWS Athena and AWS Lambda services to extract, transform, and load 8 million records into configuration entries for new services in 57 hours with zero impact on the normal function of Waitrose.com.

Background

The Identity team were building a set of services to selectively enable or disable certain customer journeys for Waitrose.com or in-store based on attributes of the customer’s online account.

We built an event-driven system (using SNS/SQS/Lambda) to trickle load the customer’s initial state based on events created by other Waitrose.com microservices. This took care of any new configuration changes after we deployed to production, but we needed to backfill the initial configuration state for the existing 8 million or so registered customer accounts.

As a team, we’ve been wanting to try building something with Step Functions for a long time, and this seemed like the perfect time to experiment a little.

First Iteration

Our first iteration in our development environment was very simple and the overall architecture looked like this:

AWS architecture diagram showing a flow from Step Function to SNS, to SQS, and finally to Lambda
First iteration of the ETL architecture

We had a single workflow that would execute an Athena query, get the results, and then transform them into appropriate messages to publish onto the SNS topic used by the trickle load services. The workflow looked a little like this:

A diagram showing the first iteration of the single workflow step function
First iteration of the Step Function workflow

We quickly learned that this would not be fast enough and that we would need to parallelise the workflows to process the data in a reasonable timeframe.

We also discovered that after around 2900 records, the workflow would hit the execution history limit of 25,000 events. Each time the workflow executes an action, it generates a history item for each transition that the action goes through. This means that a single action can generate multiple history items. As an example, publishing to SNS produces 5 history items each time you publish a batch of messages.

These issues led to our second iteration…

Second Iteration

This article describing common pitfalls of Step Functions was incredibly useful when working out how to overcome the execution history limitation and after some changes our architecture now looked like this:

AWS architecture diagram showing a flow from orchetsratror Step Function to parallel executor Step Functions, to SNS, to SQS, and finally to Lambda
Second iteration of the ETL architecture

We split the step function into two, an orchestrator and an executor. The orchestrator workflow would run the Athena queries and start new executor workflows to process the results:

A diagram showing the orchestrator workflow
The new orchestrator workflow

The executor workflow was mostly the same as before, except now after publishing each batch of messages to SNS it would check how close to the execution history limit it was. If it was within a certain margin it would then start a new executor workflow to carry on from its current position. It looked like this:

A diagram showing the second iteration of the executor workflow
The new executor workflow

Having the orchestrator execute the queries allowed us to effectively parallelise the process, so that each executor was processing the results of a single discrete query.

This proved to be a very reliable process and, with a little further refinement to pass around input parameters and cover error scenarios, was the iteration that we took to production:

A diagram showing the final executor workflow
The final executor workflow

Key Lessons Learned

During the development of the ETL process we learned some valuable lessons:

1. Low Code Workflow Studio

Last year AWS introduced the Step Function Workflow Studio, a visual low code tool to design and build workflows.

This powerful tool really sped up the development of our state machines and allows you to export the final workflow into a format suitable for our existing deployment processes using CloudFormation templates.

2. Avoid the history limit

Standard Step Functions have an event history limit of 25,000 records. This is a hard limit and if reached the execution will fail.

We avoided this by nesting workflows and starting new executions when close to reaching the history limit.

3. Optimise for speed

Early on we decided to parallelise and run multiple simultaneous “executor” step functions. This allowed us to process data in discrete batches and, no surprise, was much faster than effectively a single thread of processing.

Unfortunately our selection process for batching data was too naive, we grouped data by customer registration year. This led to some buckets being much smaller than others and consequently finishing much faster.

If we had evened out the buckets all the executors would have finished at similar times, and most likely much more quickly overall.

4. Correlation IDs for monitoring

At Waitrose, we have custom correlation ID fields to allow us to trace requests across multiple services in our logs. We used this very successfully during this process to follow each parallel execution by constructing a specially formatted correlation ID:

etl-<initial query id>-<step function id>-<unique customer id>

This allowed us to track the progress of:

- an overall bucket of data through the query ID
- an individual step function execution
- an individual customer record

We could also make use of stats queries in CloudWatch Logs Insights to calculate our percentage completion at various times.

5. Retry Everything

We learned a valuable lesson about the failure rates of the optimised integrations that Step Functions provide to work with other AWS systems such as publishing to SNS. When you’re making millions of calls over a prolonged period of time, you’re likely to have the odd failure here and there.

We experienced a couple of these failures which halted the progress of some of the executors. Luckily we had easy methods of restarting the process at any state, but if we’d made more liberal use of the built-in error handling, then we wouldn’t have needed to restart them at all.

Final Thoughts

We really enjoyed working with Step Functions for this kind of workflow, we learned a lot in doing so, and definitely plan to use it again in the future.

At the John Lewis Partnership we value the creativity of our engineers to discover innovative solutions. We craft the future of two of Britain’s best loved brands (John Lewis & Waitrose).

We are currently recruiting across a range of software engineering specialisms. If you like what you have read and want to learn how to join us, take the first steps here.

--

--

Jamie McDonald
John Lewis Partnership Software Engineering

Senior Product Engineer at John Lewis & Partners, developing AWS hosted microservices for Waitrose.