Provisioning Trial Environments With Step Functions

Gavin Cornwell
3 min readMay 9, 2020

--

We have a “Try Now” capability on our corporate website allowing prospective customers to try our Enterprise product for free for 14 days.

This has been successfully powered by a Serverless back-end for the past 3 years but there are a few issues that need addressing, one of those is the time it takes to provision a new trial.

When a trial is requested a message is added to an SQS queue. A scheduled CloudWatch Event invokes a Lambda function every minute to process messages on the queue (SQS triggers weren’t available when the original system was implemented). The Lambda function finds an available “warm” environment, allocates it to the trial request and then places a message on another SQS queue.

Another scheduled CloudWatch Event triggers another Lambda function every minute to process the second SQS queue, this function sets up the trial environment with the users details and informs our marketing system that the trial is ready.

Due to the polling it can take almost 2 minutes to provision a trial environment which is not great. Continuous polling of the queues is also wasteful as a majority of the time the queues are empty. Lastly, troubleshooting is harder than it should be when things go wrong as several log files have to be examined (the system also pre-dates AWS X-Ray).

Orchestrating this process felt like a perfect use case for Step Functions.

Figure 1 below shows a successful execution of the state machine we defined for provisioning a trial.

Figure 1.

The first step calls a Lambda function to find and allocate a warm trial environment. During busy periods there is a small chance the warm pool can be empty, to handle this case we take advantage of the retry and back-off capabilities as shown in figure 2 below.

Figure 2.

If a warm trial can not be found after three attempts the Catch block comes into effect and transitions the state machine down the error handling path.

We use the same approach for interacting with the 3rd Party marketing system API to build in some resilience.

The final stage of the state machine is to update the DynamoDB table and send an event to EventBridge (more on that in a future story) to communicate the outcome of the previous tasks.

We leverage the DynamoDB service integration to update the database meaning one less Lambda function to implement and maintain. Unfortunately, a service integration for EventBridge does not exist yet so we’ve had to implement a Lambda function to do that for now but we’re looking forward to the day when we can remove this function too, maybe at re:Invent 2020?

Probably my favourite reason for using Step Functions is the ability to easily troubleshoot issues when things do go wrong, the visualisation quickly shows where the process failed and the event history shows exactly what happened and provides convenient access to the relevant log files as shown in figure 3 below.

Figure 3.

You can find a redacted version of the whole state machine definition here.

--

--

Gavin Cornwell

Cloud Architect at Complete Technology Group with a passion for Serverless technologies.