Step Functions Express Workflows

Synchronous processes to ingest large volumes of data in AWS

Published in

AVM Consulting Blog

6 min readJan 28, 2020

Since their first launch in November 2016, Step Functions have proved to be a valuable resource for the orchestration of both simple and complex processes in AWS. They offer an abundance of great features — supporting synchronous tasks, asynchronous callbacks, and activities on EC2 and ECS instances.

Permitting over 2,000 execution starts per second, Step Functions sufficed for the majority of use cases. However, you may have wondered, how could we manage workflows for greater volumes of data? Regardless of our approach to this, how could we minimise costs if we’re processing so many events?

AWS addresses these questions with one of its latest feature additions to Step Functions: express workflows. Intended for large volume data ingestion, express workflows can start over 100,000 executions per second. Limited to five minutes per execution compared to the standard counterpart’s upper bound of one year, express workflows are intended for short lived, synchronous processes — therefore not supporting callbacks or activities.

Configuration

Working with the AWS Cloud Development Kit (v1.22 at the time of writing), let’s provision an express workflow almost exactly the same way we would define a standard one. The only additional requirement is stateMachineType property set to StateMachineType.EXPRESS in the StateMachine construct.

Unlike standard workflows, we do not have execution diagrams at our disposal to easily monitor and debug processes on the AWS console. Instead, we rely solely on CloudWatch logs to trace execution behaviour.

Whilst logging is enabled by default for express workflows created on the console, the same does not apply for the AWS CLI and SDKs. Therefore we need to manually configure logging if we use either CloudFormation or CDK for deployment.

Unfortunately, CDK’s StateMachine construct does not yet support logging configuration for Step Functions. This leaves us with two options: to configure logging using CDK’s CfnStateMachine construct derived directly from CloudFormation, or we can apply logging after deployment using the CLI.

The former presents its own challenges to us: we would no longer be able to define the state machine using CDK’s constructs. Instead, we would need to provide the full definition as a string JSON blob, which would become tedious and difficult to maintain as the state machine grows in complexity.

Resorting to the CLI eliminates the pain of this transition, so long as we provision a log group and the required Identity Access Management (IAM) permissions as part of the stack.

Now we can run the update-state-machine CLI command to enable logging, providing the log group and state machine ARNs. Using the logging configuration flag, we can set the logging level to ALL, ERROR, FATAL, or OFF. Keeping in mind this is a new feature flag, you may need to update your AWS CLI before this command will work.

Usage

When the express workflow is deployed with logging configured, we’re ready to execute the state machine. Without any state diagrams, there’s little to show on this front beyond a collection of logs. We start with the Step Functions console, running “Start execution” at the top right corner.

This prompts us to enter either an execution name or an input payload for the workflow, both of which are optional. Clicking “Start execution” again on the bottom right corner, we return to the console, and that’s us! Rather anticlimactic, right? The lack of state diagrams removes all the fun from this.

Under the “Logging” tab for the state machine, we can follow a link to the CloudWatch log group for this workflow. Like other log groups, this one is divided into a collection of streams. We have an initial validation stream auto-generated by AWS — “log_stream_created_by_aws” — followed by a sequence of streams taking form states/<name>/<datetime>/<id>.

Each of these latter streams maps to one execution of the express workflow, where the date-time is rounded to the last multiple of five in minutes— meaning an execution that runs at, say, 9:14am, will have a date time ending 09–10. It’s not immediately clear how to map these streams to the executions they log — the ID at the end appears to be randomly generated.

Opening up one of log streams, we’re presented with a collection of logs. Each of them carries a timestamp and a message, the latter of which contains a unique ID.

Scrutinising these logs further, we see additional details on the current state of our workflow. This includes details that vary from log to log, the execution ARN to map the log stream to a particular execution, and a previous event ID. This ID allows us to trace the execution back to the previous state, chaining up to guide us through the full workflow back-to-front.

Less than ideal, you may argue, if you want to analyse the flow from start to finish. Furthermore, streams present the first logs at the top of the page, so you need to scroll through any number of logs to start your analysis at the bottom of the page. A next_event_id attribute or equivalent to chain from start to finish would be a welcome feature to resolve this inconvenience.

That aside, the logs provide sufficient detail on input and output data, resources such as Lambda involved in the workflow, and any failures encountered. This allows us to confirm the execution worked as expected, or debug any errors that may have been encountered.

Costs

To quote Jeff Barr directly from his blog post introducing express workflows:

While the pricing models are not directly comparable, Express Workflows will be far more cost-effective at scale.

Somewhat vague “at scale” — presumably well above the 2,000 limit per second imposed for starting executions on standard workflows. It would be great to compare the express workflow to its standard counterpart with high volume loads to grasp an understanding of the savings. If you’ve done this, I would be keen to hear your results! Further details on the pricing for each of these workflow types is explained in detail in AWS’ pricing documentation.

Conclusion

A promising new feature from AWS with plenty of potential for further improvement. Express workflows will prove particularly useful for quick, high volume processes — keeping costs down where we would have previously resorted to the standard workflow.

It’s a shame to see execution diagrams are not provided for this flow type. Like standard workflows, it’s possible to define complex processes making choices based on the data given. Without a diagram to easily visualise what’s going on step by step, we must tediously trace logs backward to understand the full behaviour of the execution.

Focusing on the logs, it would be great to easily map log streams to the executions they record, and have them enabled by default for both the CLI and the SDKs. The final icing on the cake would be to provide a next_event_id attribute in the logs so we can trace executions in both directions, easily alternating back and forth if desired.

Update (10/02/20): AWS offer a definition tab to visualise express Step Functions state machine definitions. Whilst this isn’t quite as good as the execution diagrams offered by standard Step Functions, it’s a step in the right direction!