Creating Scalable Solutions with Dynamic Parallelism

Wondering how to create a scalable solution using Step Functions? You’re in the right place!

Step Functions is a serverless orchestrator that makes it easier to sequence different AWS services into business-critical applications. It provides quite a bit of convenient functionality: automatic retry handling, orchestration, monitoring and alerting. On Gousto Recipes — Building Pipelines with Step Functions, you can see how we have been using Step Functions at Gousto.

In this blog post, we will go into more details on how to take advantage of Dynamic Parallelism (some people might know that as Fan-out).

What problem were we trying to solve?

Site Specific Forecast pipeline is one of the key products of the Supply chain and predicts how much ingredients and products each factory should have in stock for the next 2 weeks.

As Gousto grows and opens more factories in different parts of the UK, the Planning team needs to plan what to buy for each factory way more in advance, therefore there is a business requirement to forecast the upcoming 4 weeks.

How was the architecture?

You can see below a simplified version of the pipeline:

  • Configurator: setup variables for an execution such as weeks to forecast, order simulator constraints and factory caps.
  • Data Extractor: load data for each week required by Order Simulator and Site Specific Forecast.
  • Order Simulator: generate simulated orders for each week.
  • Site Specific Forecast: build a forecast with how much ingredients and products each factory will need.
  • Publisher: send an email to the Planning team with the forecast and provide data to downstream services.

In this architecture, each Lambda function was responsible for loading or making some computation for all weeks. For example, if we wanted to forecast for weeks A and B, Data Extractor would load data for both weeks, Order Simulator and Site Specific Forecast would perform some operations for A and B. And, finally, Publisher would send an email to the Planning team with corresponding forecasts.

This architecture worked pretty well for almost 1 year, however it is not scalable for mainly two reasons:

  1. The maximum execution duration of a Lambda function is 15 minutes. The Order Simulator was already taking 10 minutes to run for 2 weeks, so there was not much buffer to add an extra week.
  2. To forecast for an extra week would mean changing code of all components of the pipeline.

Map State to the Rescue

The Map state can be used to run a set of steps for each element of an input array. In our case, Configurator would configure the weeks to forecast and inject that into the Menu Week Map state which would execute the same steps for each week. It is important to highlight that this solution is scalable for N weeks and we do not need to change any code apart from the Configurator if we want to forecast more weeks.

Essentially, we had to make two changes to the pipeline:

  1. Each Lambda function would perform tasks over a single week
  2. Update the state machine to wrap up Data Extractor, Order Simulator and Site Specific Forecast within a Map state.

You can find below a snippet code with the state machine.

Let’s have a look at Menu Week Map state — for the sake of simplicity we are using Pass states:

  • ItemsPath: an array of elements (in our use case, that is a list of numbers coming from Configurator). The value is a reference path identifying where in the effective input the array field is found.
  • Parameters: You can define a set of variables that are accessible to all states within Iterator.
  • Iterator: where you define states of a Map, we have two simple rules: states inside an Iterator can only transition to another state defined inside the same Iterator, and no state outside the Iterator field can transition to a state within it.
  • MaxConcurrency: number that provides an upper bound on how many invocations of the Iterator may run in parallel.

I just covered the basics and I highly recommend you to visit Map — AWS Step Functions.

How long does it take to run?

In the previous architecture, the pipeline was taking about 10 minutes to get executed because each Lambda function was performing operations over 2 weeks.

Now, with Dynamic Parallelism in place, the pipeline takes half of the time and predicts 4 weeks! Actually, it would still take the same amount if we had to forecast N weeks because each Lambda function would perform operations over one single week concurrently.

Conclusion

We can build scalable solutions using Step Functions which provides dynamic parallelism out-of-the-box through Map state. In this blog post, we showed how we scaled Site Specific Forecast pipeline to forecast N weeks out instead of just two weeks in half of the time! Last but not least, we now have a more maintainable system where we just need to touch one component in case we want to forecast an extra week.

Interested to join us and build some interesting scalable solutions? Check out our openings - Gousto Jobs!

Gousto Engineering & Data

Gousto Engineering & Data Blog