Optimising AWS Step Functions

Published in

Engineers @ The LEGO Group

8 min readJun 29, 2023

Working with Step Functions can be tricky, there are so many different ways you can connect all the pieces, how do you decide which is best for your scenario? Recently, I’ve been experimenting using the AWS CDK to create Step Functions and I thought I’d share some of my learnings as well as a few code snippets.

The feature I’ve been working on is one where a user can register an activity and receive some points in return. When considering how to put this together it seemed sensible to orchestrate the different stages using a Step Function.

Build Fast

To start with, we broke down the problem into the basic steps that were needed to complete the journey. Ours is comprised of 4 parts:

Check that the initial message received by the Step Function is as we expect it to be and it contains all the necessary information needed to execute the following tasks. In an event-driven environment, this message could be from an external source which is why it’s useful to do this step.
Validate that the user is eligible to receive these points. In our case, this can be done via an API call to another service.
Check this activity hasn’t already been registered by the user. We do a lookup against items in a DynamoDB table that has records of previously registered activities.
Finally, if all these steps are completed successfully then we will send an event on to another service via EventBridge. Another service will be listening to these events to allocate the points, this follows a microservice architecture pattern.

Taking all those points into consideration, the first draft could look something like this. It’s a simple representation of these steps, each in Lambda functions, in a diagram from the Step Function Workflow Studio in AWS.

A step function with 4 sequential lambda invoke steps

While this pattern may work, it’s certainly not optimal and there are plenty of improvements we can make. One thing to consider is, does it always have to be Lambda functions that do all the work.

I’ve included a code example of this Step Function below using the AWS CDK with Typescript.

const validatePayload = new tasks.LambdaInvoke(stack, 'Validate Payload', {
  lambdaFunction: <Validate Payload Lambda>,
});
const validateUser = new tasks.LambdaInvoke(stack, 'Validate User is Eligible', {
  lambdaFunction: <Validate User Lambda>,
});

const validateActivity = new tasks.LambdaInvoke(stack, 'Check Activity is Valid', {
  lambdaFunction: <Validate Activity Lambda>,
});
const sendEvent = new tasks.LambdaInvoke(stack, 'Send Event', {
  lambdaFunction: <Send Event Lambda>,
});

new sfn.StateMachine(stack, 'mysf', {
  stateMachineName: 'mysf',
  definition: validatePayload
    .next(validateUser)
    .next(validateActivity)
    .next(sendEvent)
});

Direct Integrations

Step Functions offer a huge amount of direct integrations where you can call the AWS services API actions from your workflow. Sometimes it’s easier and faster to not have your code wrapped in a Lambda and use a direct integration instead.

Step function with 4 sequential steps, 1 2 & 4 are lambda’s whereas 3 is a direct integration with DynamoDB.

In our case, we can use the integration with DynamoDB to do a Get Item request. To make this a dynamic lookup you can use the Step Function input as the primary key in your request which is illustrated in this AWS CDK code snippet.

const checkActivityId = new tasks.DynamoGetItem(
    stack,
    'Check if activity has been registered in DynamoDB Table',
    {
      key: {
        pk: tasks.DynamoAttributeValue.fromString(
          // Lambda's result is in the attribute activityId
          sfn.JsonPath.stringAt('$.activityId')
        ),
      },
      table: "Your Table",
    }
  );

Parallel and Choice States

The steps we currently have are simple and make sense in this order, but in reality, they don’t have to happen one after the other.

We probably want to keep Step 1 “Validate Payload” in place as then any malformed inputs don’t get passed to the rest of the steps. However Steps 2 and 3 could be done in parallel, they do not depend on each other but both do have to pass for us to carry on to Step 4. In other words, a user must be eligible and not have registered this activity previously for us to be able to issue points.

Step Function Step with a Parallel state and 2 branches

A Step Function parallel state will wait for all branches to terminate before passing the state onto the next task. This is useful in our example as even if one side gets evaluated first it will not automatically trigger the next step and it will wait for the other parallel branch to complete. Another useful characteristic is if one step fails, the entire parallel state is considered failed. This is the behaviour we want as we need to know the outcome of both sets of information before moving on to the next step and if one check fails, we do not want to carry on.

Creating a parallel branch using the AWS CDK is relatively simple and you can add as many branches as you need.

  const parallel = new sfn.Parallel(
    stack,
    'Validate Activity Id and User in parallel'
  );

  parallel.branch(checkCode);
  parallel.branch(validateUser);

Evaluating output

This all sounds great so far but once both branches have been completed how do you evaluate the output of a parallel state?

The output will always be an array with one element for each branch which is the output for that branch.

Output: [ <output Branch1>, <output Branch2>]

At each step, you can specify what you want your output to be. So it could look something like this simplified version below. The first object in the array shows whether the user is eligible and the second shows if the item has been found in the DynamoDB database and therefore whether it can be registered for points.

Output: [ { isUserEligible: true }, { previouslyRegisteredActivity: false } ]

We can use a Choice state to evaluate these. If both sides of the branch pass our checks, the user is eligible and the product has yet to be registered, then proceed onto the Send Event Lambda. If either branch errors or the checks fail then stop the execution and don’t send an event to the next service.

Choice states support a lot of comparison operators which are a powerful tool for evaluating varied or complex outputs.

I’ve included an example of creating this workflow with the AWS CDK below. This shows one way of making your Step Function resource but it’s also possible to export the raw Amazon State Language (ASL) definition from the Workflow studio and use that instead.

import {
  aws_stepfunctions as sfn,
  aws_stepfunctions_tasks as tasks,
  aws_dynamodb as db,
} from 'aws-cdk-lib';
    
// Lambda Invoke Tasks
const validatePayload = new tasks.LambdaInvoke(this, 'Validate Payload', {
  lambdaFunction: <LambdaFunction>,
});

const validateUser = new tasks.LambdaInvoke(this, 'Validate User', {
  lambdaFunction: <LambdaFunction>
});

// Direct Integration with DynamoDB
const checkActivityId = new tasks.DynamoGetItem(
  stack,
  'Check if activity has been registered in DynamoDB Table',
  {
    key: {
      pk: tasks.DynamoAttributeValue.fromString(
        sfn.JsonPath.stringAt('$.activityId')
      ),
    },
    table: "Your Table",
  }
);

// Create Parallel step with 2 branches
const parallel = new sfn.Parallel(
  stack,
  'Validate Activity Id and User in parallel'
);

parallel.branch(checkCode);
parallel.branch(validateUser);

const failedStatus = new sfn.Fail(stack, 'Failed Validation', {
  cause: 'User or activity Id validatation failed',
  error: 'Job returned FAILED',
});

const successState = new sfn.Pass(stack, 'SuccessState');

// create choice state & evaluate the output of both branches
const choiceState = new sfn.Choice(stack, 'User Vaidated & Code Checked?')
    .when(
      sfn.Condition.and(
        sfn.Condition.stringEquals('$[0].previouslyRegisteredActivity', false),
        sfn.Condition.stringEquals('$[1].Payload.isUserEligible', true)
      ),
      successState
    )
    .otherwise(failedStatus)

// Link all the steps together in a definition
const definition = validatePayload
  .next(parallel)
  .next(choiceState)
);

// Create State Machine
new sfn.StateMachine(stack, "Your SF Name", {
  stateMachineName: "Your SF Name",
  definition,
});

Further Improvements

There are a few more optimisations we could consider for this Step Function, for example in Step 4 we could use a direct integration with EventBridge instead of a Lambda however, there are a few reasons we might decide not to in this case.

It’s not as easy to add timestamps in direct integrations.
You have less control over the data & it’s harder to form a more complicated shape.
If you are passing on Personally Identifiable Information (PII) then it’s better to have more control over the data and to be able to obfuscate any logging.
Sometimes you want to do multiple things in one step and a Lambda could be useful for that. Although you may want to be wary of this and keep separation of concerns as much as possible.

Express Step Functions

Step Functions can be either an Express type or a Standard type which have several differences between them. In this case, it would make sense for it to be an Express type as the whole execution is expected to take less than 5 minutes and the pricing structure would mean our costs would be reduced by switching to Express.

As ever it isn’t always that straightforward and there are trade-offs, these include the fact that logging isn’t enabled by default for Express Step Functions. You will have to enable these yourself when instantiating the resource in your code. Creating your own log group will cause you to hit the CloudWatch Logs resource policy size restrictions and you will have to prefix your log names with “/aws/vendedlogs/” to get around this.

Express Workflows also do not support Run a Job (.sync) or Wait for Callback (.waitForTaskToken) service integration patterns that you may require for certain jobs.

Creating your logging group on an Express Step Function and side-stepping the resource policy size limit is relatively simple:

import {  aws_stepfunctions as sfn, aws_logs as logs } from 'aws-cdk-lib';

const logGroup = new logs.LogGroup(this, `my-sf-logs`, {
  logGroupName: `/aws/vendedlogs/<my-sf>`
});

new sfn.StateMachine(this, 'mysf', {
  stateMachineName: 'mf-sf',
  definition,
  logs: {
    destination: logGroup,
    level: sfn.LogLevel.ALL,
  },
  stateMachineType: sfn.StateMachineType.EXPRESS,
});

Conclusion

Using direct integrations might be better for some scenarios.
Utilise the other States such as parallel, map, and choice to make your Step Functions more efficient.
Express Step Functions might be good to consider as they will generally reduce costs.
Finally using the Workflow Studio can be a good way to visualise your workflows and get started building your Step Functions.