Serverless Orchestration with AWS Step Functions: Lessons Learned

Fatlum Vranovci
Oct 22 · 5 min read
The London Philharmonic Orchestra performing at the Southbank Centre

The emergence of Cloud computing, and the more recent emergence of MLOps (the combination of Machine Learning and Operations), has shown that there is an eagerness from businesses to take advantage of Machine Learning technology. Although businesses in certain sectors already use ML, many are still in the early stages of adoption. They are yet to advance their ML capabilities to something more than just a science experiment.

The Applied Data Science and Machine Learning (ADSML) team at Sainsbury’s Tech carries a vision: “To give colleagues and customers access to automated data-driven support for all their complex decision making”. We want to be able to help make these decision-making algorithms as accessible and as automated as possible.

A year ago or so, the ADSML team was new to the world of serverless architecture and building data pipelines that productionise data science algorithms. We built our first couple of pipelines with similar approaches. The first being 10–12 different Lambda Functions chained together, with each triggered by the output of the previous Lambda landing a file in S3. Each Lambda would perform a small part of the pipeline. This would include ETL (Extract-Transform-Load), algorithm execution and post processing. Although these pipelines worked well, we soon figured out the limitations of this approach and of Lambda Functions in general. These limitations were:

  • Messy S3 buckets: loads of files landing into different areas in S3
  • No single view of execution flow, no graphical/easy way to track the executions in real-time
  • With around 8000 execution per day, it made it very difficult to find single points of failure
  • It was difficult to install branching logic. Branching logic is when the pipeline is able to split and go down different routes depending on the outcome of a step. For example, there may be a step which determines whether we should retrain a model, if it does, it will proceed down the retraining route, if it doesn’t, it will proceed down the inference route.

The solution to this would be to use a tool which can orchestrate all our Lambda Functions. The tool should allow engineers to build pipelines where points of failure are easy to identify, errors are dealt with and state isn’t lost. This is where a service named AWS Step Functions comes in.

Step Functions is an AWS service which gives users a reliable way to chain together all components of a pipeline. It’s a fully managed service which means that you won’t have to worry about setting up the infrastructure in order to run it. All of this is handled by AWS so there’s no more painful machine configuration and maintenance (see Airflow). Various AWS services are available to use, and executions can be coordinated and tracked in a visual way. You can create pipelines which are easy to run and debug, but it also makes branching or parallel steps easy.

Step Functions uses state machines which allow you to define your workflows as individual tasks called “States”. Each state can perform many different functions, which defines the “Type” of state you want to use. This includes:

  • Task state (do some work)
  • Choice state (make a choice between branches of execution)
  • Parallel state (begin parallel branches of execution)
  • and more

The configuration of state machines is written in Amazon’s States Language which is a JSON-based, structured language used to define each one of your states. The following is an example from the AWS Step Functions developer guide. It shows a state named HelloWorld that executes an AWS Lambda function.

"HelloWorld": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:0000000000000:function:HelloFunction",
"Next": "AfterHelloWorldState",
"Comment": "Run the HelloWorld Lambda function"
}

Within each state definition you will need to specify the following:

  • The name of the state
  • The ‘Type’ of state (as described above)
  • The type of resource you want to use. This can be invoking a Lambda function, creating a SageMaker training job, etc.
  • The name of which state will come next. This is what allows you to chain states together

Then, as you create more states and chain them together, you can always refer to the visual representation of your workflow to see how it works. Below is an example of a simple workflow we developed.

{
"StartAt": "generate_uuid",
"States": {
"generate_uuid": {
"Type": "Task",
"Resource": "arn:aws:lambda:eu-west-1:0000000000000:function:generate_uuid",
"ResultPath": "$.run_metadata",
"Next": "set_config"
},
"set_config": {
"Type": "Task",
"Resource": "arn:aws:lambda:eu-west-1:0000000000000:function:set_config",
"ResultPath": "$.config",
"Next": "prepare_feature_set_correlation"
},
"prepare_feature_set_correlation": {
"Type": "Task",
"Resource": "arn:aws:lambda:eu-west-1:0000000000000:function:prepare_feature_set_correlation",
"ResultPath": "$.run_metadata",
"Next": "run_register_task_correlation"
},
"run_register_task_correlation": {
"Type": "Task",
"Resource": "arn:aws:lambda:eu-west-1:0000000000000:function:run_register_task_correlation",
"ResultPath": "$.run_metadata",
"End": true
}
}
}

The config defines how the workflow will run across the multiple Lambda functions. It shows that the state machine starts from the ‘generate_uuid’ state which runs a Lambda function, and ends after the ‘run_register_task_correlation’ state, which also runs a Lambda function.

Plugging this config into Step Functions produces the lovely graphic of your pipeline, shown below.

Workflow visualisation provided by the Step Functions console.

With this graphical view of the pipeline, any state that passes will light up green, whereas failed states will light up red. This makes it easier to go straight to the relevant error messages to find out what’s gone wrong. Cool, right?

Although Step Functions has improved our ways of working and helped us manage our pipelines, it does come with some limitations:

  • State machines have to run from the beginning each time. This means if for some reason the workflow fails at any step, you can’t restart it from that step. This can become annoying if one of the first steps is a time consuming ETL step.
  • Amazon States Language has a bit of a learning curve and can possibly be a deterrent for engineers who are more used to something like Airflow.
  • Step Functions integration is currently limited to certain AWS services
  • Like all other AWS services, Step Functions has limits. Make sure to check those before determining whether Step Functions is the best tool to use for your use case.

Data has now become a focal part of our business so it is now essential to productionise the exploitation of that data. This will enable the decision makers of the business to do their jobs faster, and with more confidence. The ADSML team always looks to re-evaluate how we deliver our solutions, and the introduction of Step Functions is a great example of that. As we continue to develop the data pipelines with an orchestra of services, Step Functions will be continue to be the conductor.

Sainsbury’s Data & Analytics

Sainsbury’s Data & Analytics

Thanks to Christopher Sidebottom

Fatlum Vranovci

Written by

Data Engineer working at Sainsbury’s Tech, as part of the Applied Data Science and Machine Learning team.

Sainsbury’s Data & Analytics

Sainsbury’s Data & Analytics

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade