Using AWS Step Functions for microservice orchestration

Ruslan Gainutdinov

9 min readJun 18, 2019

This is a follow-up blog article on my talk on AWS User Group meetup on June 13, 2019

Take a look at the slides for talk:

Step Functions for microservice orchestration (2019-06-13).pdf

Slides

bit.ly

Introduction

For distributed architecture and microservice applications, it is crucial to have some logic defined which allows you to process failures on the application-level.

For example, if we have an application to order a travel package, we need to order everything or nothing:

Order hotel
Order flight
Order car

If an error happens or if flight or hotel is not available, we don`t want to hold a rent a car order; we need to cancel (also called compensation logic) it on the application level.

Sagas

This is not a particularly recent problem, in 1987 there was whitepaper published called Sagas (Sagas, Garcia-Molina H., Salem K., (1987), Princeton University, Princeton). It describes that if you have distributed architecture and can`t use ACID business transactions, you must use logic on application level and do either choreography or orchestration.

Choreography

One way of handling this is to implement this compensation logic on the application level.

For example, if we have already ordered a flight and can’t order hotel, we invoke specific cancel flight API endpoint to cancel the order altogether.

This can become complicated when the workflow for product-level process gets more sophisticated.

During the time where was found ways of doing that in a more or less manageable fashion.

Queues or middleware (ESB)

We can use queues to connect different processing services and react to errors by placing failures in a different queue. We can also scale processing power by adding more consumers.

This concept is implemented in the Enterprise Service Bus (ESB) approach which widely used

But using queues have some downsides, namely:

Sequence defined on queue (middleware) level and can be difficult to change
Workflow is out of application scope, and small, agile teams lose control and depend on centralized middleware/ESB to improve applications.
Tracking a single order, for example, travel package and all execution in it are difficult because you need to look at the different components — queue level logging, middleware logging, microservice logging to bring it all together.

Client-side choreography

Another way to do is to implement specific invocation sequence at the application level (web, mobile) by using state machines or redux-saga, for example. This way is pretty straightforward and actually a good fit for small teams and simple workflows.

Client-side choreography has some limitations, for example:

Exposes internal services, which might be hidden inside your infrastructure otherwise
Exposes full order payloads which you carry between microservice invocations
In case of mobile application (and web until the cache expires and person reloads) workflow is fixed in the specific version of the application and not so easy to update. You at the mercy of user to update application from App store.
If the complexity of workflow increases, state machine might be difficult to read and make changes.

Backend for frontend

One of the popular approaches is to implement a backend for frontend pattern by implementing product-level processes and complex logic inside the intermediary level, which calls other general-purpose microservices.

By implementing workflow logic in your language, you keep it close to your team and make it relatively easy to change.

This is a great pattern to use, but it has its downsides:

Workflow to call services are not easily understood by all stakeholders
Retries and errors handling can be difficult
Bad practice for serverless because effectively you are paying the double price when you call other services by waiting for their response

Orchestration

Other proposed way is to have a central “daemon” or orchestrator in your infrastructure which handles the order workflow and makes requests to other general-purpose microservices.

It is paramount for this orchestrator to be reliable, scalable, and flexible enough for you to implement any workflow you want.

How is it done?

You invoke orchestrator and ask him to process this travel package order, giving orchestrator user preferences for flight, hotel, and car. After that orchestrator handles order by its own, invoking microservices in the order, you have defined in the workflow.

By having a single entity handle workflows, you get transparency on the what has been invoked, how much time it spends processing this or that task and if, any errors, what is the error was. It also can handle retries and conditional logic in a streamlined way.

The good things about orchestration:

A single place to change and manage workflows
Transparency on the workflow execution
Easy to improve processes based on the stakeholder’s feedback
Data available to understand what microservices need to be scaled
Simplifies microservices — they do one thing, and do it well and don`t need to know about other microservices.
Optimization and cost estimation is easily done using workflow history and data history

Cons to using orchestration:

Complexity increases as there other microservice (or SaaS) to manage
More difficult to test (you need to test microservices, workflow, and integration testing — workflow & microservices). Same applies to the backend for frontend approach too, but in this case, it is a different framework, language, or service.
Another framework/library/service to learn

Transparency

One of the main incredible thing about the orchestration is the transparency of things going in a workflow.

By using visualized workflow languages (such as BPMN or AWS Step Functions generated workflow graphs), you can sync and be on the same page with all stakeholders from CxO to developers and have the same vision how your product works.

Transparency also includes having access to analytics on how workflow execution went with the timing of all invocations, conditions, and parallel executions.

AWS Step Functions execution event history for the workflow instance

Transparency also makes your processes optimization capable, and you can use data and analytics to understand which part of the workflow is slow or unreliable, and bring more resources, improve or replace it.

Business errors, i.e., high-level events which impact your product is also easier to separate from the network and technical problems and react differently for them.

AWS Step Functions

Amazon Web Services cloud provides an implementation of workflow engine working in the cloud. Called AWS Step Functions they provide a way to define a workflow with steps which invoke different AWS services available.

In AWS Step Functions workflows are called Step Machines, and you deploy them using UI, CLI, or CloudFormation to the AWS.

Step machines declared using JSON and contain a list of steps which are invoked according to a logic defined for each step.

For this particular workflow, we call sequentially 3 AWS Lambdas to order hotel, flight and rent a car.

We can edit them online, and AWS will automatically generate workflow graph

Notable features

Automatic scaling
Pay per transition with the forever free tier of 4000 step transitions per month
Process, invocation and payload history
Edit online, using CLI or CloudFormation. Also, serverless framework and Terraform support Step Functions natively.
IAM role to secure all invocations during workflow execution
Retries are integrated and activated by adding configuration to step definitions
You can react to different errors by specifying the error type and which next task to call if this error occurs
Conditional branching by using task output to define which next step to go
Run multiple tasks in parallel
Pass workflow execution details to any task to have it known which workflow it is a part of
Wait for callback task to integrate with external, polling tasks or custom processing

Supported AWS services

AWS Lambda — call function
AWS Batch — SubmitJob API
Amazon DynamoDB (update, put, get, delete)
Amazon ECS/Fargate (async/sync)
Amazon SNS — publish a message
Amazon SQS — send a message
Amazon SageMaker
Wait for callback (integrate external/3rd party/on-premises)

Asynchronous invocation

One of the key concepts of executing workflow is that when you start the workflow, it executes asynchronously, without you waiting for its completion. Completion event must be delivered using either one the steps inside of workflow or using Cloud events.

Example

To dig deeper into microservice orchestration, take a look at the example application which uses AWS Rekognition service to find smile and face of the person on photos.

You can try it your self by opening https://find-face.ruslan.org/ on the phone or the browser.

This app is built using Bulma UI framework with static client-side React application; all services are implemented as Serverless AWS Lambdas using Typescript.

Full open source available at GitHub at https://github.com/huksley/aws-detect-faces-workflow

Architecture

In short, when you select/take a new photo, it gets uploaded to S3 after that S3 listener lambda fires and launches the new workflow. The workflow starts AWS Rekognition lambda first to detect smile and face rectangle. If face found, it starts to Generate Thumbnail lambda which crops to face and saves a new image to S3. After every step application uses Google Firebase Cloud Messaging to notify app about the progress of the workflow.

When you execute the workflow, it gets recorded in execution history in AWS Step Functions, along with the visualization of the path the execution took

You can also take a look at full execution history along with times every step is taken and what was the input and output payloads for the step.

Workflow focused improvements

There is a lot of improvements which this example might have, for example:

Use AWS Rekognition celebrity detection and notify admins if the profile of the user is new to check for a fake profile.
Implement custom face recognition algorithm and test it in parallel or do A/B testing for a small subset of the user.
Implement sending a notification to the followers of the user.

By splitting workflow and microservices, you not only simplify your product architecture but also split work needed for features in more simpler chunks. For example, using AWS Rekognition celebrity detection is simply improving AWS Lambda also to include celebrity detection. After this is done it is effortless to improve workflow and conditions to send a notification to admins.

Orchestration is a product/team feature

By using orchestration, you gain flexibility on a higher level, which helps you:

Keep microservices architecture simple
Make product logic visible and flexible
Scale
Analyze and improve

About author

Ruslan Gainutdinov is a Software Architect and Engineer with a lot of experience in Serverless, Business Process Management and leading development in multifunctional teams.

Github profile: https://github.com/huksley/
LinkedIn profile: https://www.linkedin.com/in/ruslanfg/
Follow me on Twitter where I post frequently about microservice orchestration: https://twitter.com/huksley_