Gousto Recipes — Building Pipelines with Step Functions

Gousto is on a mission to become the UK’s favourite way to eat dinner. Essentially, customers can select from an ever-changing range of recipes every week and receive fresh ingredients in a box on their doorstep with easy-to-follow recipes. The sheer number of UK families placing their trust in Gousto every week means we need to build a supply chain that is supported by a reliable tech stack, and AWS technologies are one of the most important ingredients of our tech products.

We have recently adopted AWS Step Functions, a serverless orchestrator that makes it easier to coordinate different AWS services and manages error handling, retry logic and states, removing a huge operational burden from engineers. I work in Pickles (all squads here have veggie names) and I will present some use cases where we have been using AWS Step Functions.

A Recipe for Disaster

In Pickles, we have a suite of dedicated products that drives intelligent decisions in our Supply chain by understanding the what, when, and where of demand; they became known by the name of Combination Forecast. Although these products have a high dependency on each other, not so long ago, they were treated as individual services and they were executed on their own schedule. The diagram below describes some products and how they interact with each other.

The architecture (or the lack of it) was a set of processes executed without a central orchestrator where each component would run within a Lambda function before the next component with a bit of buffer between each execution. In the best-case scenario, each component would get executed successfully at the scheduled time before the next one and the data would flow over all components until they were published to downstream services and stakeholders by the Publisher.

Most of the time, this architecture would work just fine, however, we were facing some challenges such as how to implement a retry-logic in case one of the components fails due to some temporary issue with the database; monitoring and alerting was a repetitive task as we had to individually add those to each component, and debugging issues was definitely a nightmare because sometimes we would have to go over all logs component by component individually to investigate the root cause of the problem. Yep, this was a recipe for disaster!

Step Functions to the Rescue

Step Functions is a serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple AWS services into business-critical applications. Essentially, a Step Functions is formed by a state machine which contains states that can perform actions, make decisions based on their input and pass output to other states. In this section, I will describe how we have been using Step Functions on our projects.

AWS Services Orchestration

Our suite of products was a set of Lambda functions running on their own schedule with no central orchestrator. Step Functions helps us to sequence Lambda functions in a specific order and coordinate the data flow between them. Also, it provides a UI where we can see the workflow of the pipeline and how each component of the architecture interacts with one another. The picture below shows a simplified version of the Combination Forecast, now referred to as the Order Simulation Forecast pipeline.

On the same page where we can see the diagram above, we can also find the execution event history with its states, inputs and outputs. Note that there are even links to Lambda functions and Cloudwatch logs that are super helpful for debugging.

Another benefit of having a central orchestrator is that monitoring becomes an easy task. Step Functions offers metrics related to the execution of the workflow and integration with Lambda functions and other services. You can find a whole avenue of how to monitor your Step Functions on Monitoring Step Functions Using Cloudwatch.

Branching

Depending on the day of the week, we might want to run some states only for one of the factories. Using a Choice state, we can tell Step Functions to make decisions based on the Choice state’s input. For example, on Fridays, the pipeline should run Network Configuration for only Factory A, while should do the same for Factory B on Mondays.

Also, we have been using branching to ship new versions of Lambda functions behind feature flags, allowing us to test the execution of the state machine with different versions of a component without the need to deploy new code.

Error Handling

Distributed systems are formed by multiple components located on different machines and one thing that we know for sure is that temporary or transient failures will most likely occur when these different components try to talk to each other. These failures can be caused by different reasons such as losses of connection to the network, temporarily unavailable resources and exceeded response times.

Step Functions provides Retry and Catch statements that enable us to implement Retry patterns and make our systems more fault-tolerant and resilient. With Retry statements, we can retry a state which fails every X seconds until the state machine reaches a maximum number of attempts.

Catch statements are an option in case we want to have some fallback mechanism and finish the execution of the workflow gracefully in case one of the states fail or does not reproduce an expected output. I will not extend myself too much on these features since you can find a lot of details in Error Handling in Step Functions.

We have been using Retry statements to handle lengthy states in the Order Volume project that predicts order volume for upcoming weeks. This project uses AWS Forecast API endpoints to create and train models, and it is formed by states that can take up to 2 hours to finish. The snippet code below shows a simplified version of the creation of an AWS Forecast resource.

As we can see, the Lambda function will try to submit a request to create a predictor if that was not done previously and check its status. If the operation is still pending, it will raise a ResourcePending exception. Based on the state definition with retries, the Step Functions will execute the state again within 1 second for at least 100 times whenever ResourcePending is raised. Once the resource is finally created, the state machine will continue to the next state.

Parallel Processing

Some lambda functions might take longer than others to execute and it is quite useful to run some states in parallel to speed up the execution time of the entire pipeline. In the Order Volume Forecast project, we have been taking advantage of the Parallel state to create AWS Forecast resources in parallel, just a reminder from the previous section: they can take up to two hours to be created!

Another use case is to load data from a database in batches, as we can see on the diagram below, there are different instances of a Lambda function running in parallel and fetching data from databases. Previously, the loop to fetch data was done inside the Lambda function and the process could take up to 8 minutes. Now, the process takes about 2 minutes.

Conclusion

As our systems become more complex, the complexity of managing them also grows. Step Functions helps us to shape business-critical applications. We have been using Step Functions in different use cases at Gousto involving sequencing, branching, error handling and retry logic, which removes a significant burden from our engineers and gives us more time to focus on what matters: become the UK’s favourite way to have dinner.

We have been spending a fair bit of time with Step Functions and enjoying the service. What about you? I would love to hear how other engineers and companies have been using Step Functions.

Gousto Engineering & Data

Gousto Engineering & Data Blog