Gousto Recipes — Building Pipelines with Step Functions

Gousto is on a mission to become the UK’s favourite way to eat dinner. Essentially, customers can select from an ever-changing range of recipes every week and receive fresh ingredients in a box on their doorstep with easy-to-follow recipes. The sheer number of UK families placing their trust in Gousto every week means we need to build a supply chain that is supported by a reliable tech stack, and AWS technologies are one of the most important ingredients of our tech products.

We have recently adopted AWS Step Functions, a serverless orchestrator that makes it easier to coordinate different AWS services and manages error handling, retry logic and states, removing a huge operational burden from engineers. I work in Pickles (all squads here have veggie names) and I will present some use cases where we have been using AWS Step Functions.

A Recipe for Disaster

The architecture (or the lack of it) was a set of processes executed without a central orchestrator where each component would run within a Lambda function before the next component with a bit of buffer between each execution. In the best-case scenario, each component would get executed successfully at the scheduled time before the next one and the data would flow over all components until they were published to downstream services and stakeholders by the Publisher.

Most of the time, this architecture would work just fine, however, we were facing some challenges such as how to implement a retry-logic in case one of the components fails due to some temporary issue with the database; monitoring and alerting was a repetitive task as we had to individually add those to each component, and debugging issues was definitely a nightmare because sometimes we would have to go over all logs component by component individually to investigate the root cause of the problem. Yep, this was a recipe for disaster!

Step Functions to the Rescue

AWS Services Orchestration

On the same page where we can see the diagram above, we can also find the execution event history with its states, inputs and outputs. Note that there are even links to Lambda functions and Cloudwatch logs that are super helpful for debugging.

Another benefit of having a central orchestrator is that monitoring becomes an easy task. Step Functions offers metrics related to the execution of the workflow and integration with Lambda functions and other services. You can find a whole avenue of how to monitor your Step Functions on Monitoring Step Functions Using Cloudwatch.

Branching

Also, we have been using branching to ship new versions of Lambda functions behind feature flags, allowing us to test the execution of the state machine with different versions of a component without the need to deploy new code.

Error Handling

Step Functions provides Retry and Catch statements that enable us to implement Retry patterns and make our systems more fault-tolerant and resilient. With Retry statements, we can retry a state which fails every X seconds until the state machine reaches a maximum number of attempts.

Catch statements are an option in case we want to have some fallback mechanism and finish the execution of the workflow gracefully in case one of the states fail or does not reproduce an expected output. I will not extend myself too much on these features since you can find a lot of details in Error Handling in Step Functions.

We have been using Retry statements to handle lengthy states in the Order Volume project that predicts order volume for upcoming weeks. This project uses AWS Forecast API endpoints to create and train models, and it is formed by states that can take up to 2 hours to finish. The snippet code below shows a simplified version of the creation of an AWS Forecast resource.

As we can see, the Lambda function will try to submit a request to create a predictor if that was not done previously and check its status. If the operation is still pending, it will raise a ResourcePending exception. Based on the state definition with retries, the Step Functions will execute the state again within 1 second for at least 100 times whenever ResourcePending is raised. Once the resource is finally created, the state machine will continue to the next state.

Parallel Processing

Another use case is to load data from a database in batches, as we can see on the diagram below, there are different instances of a Lambda function running in parallel and fetching data from databases. Previously, the loop to fetch data was done inside the Lambda function and the process could take up to 8 minutes. Now, the process takes about 2 minutes.

Conclusion

We have been spending a fair bit of time with Step Functions and enjoying the service. What about you? I would love to hear how other engineers and companies have been using Step Functions.

Gousto Engineering & Data

Gousto Engineering & Data Blog