Since its introduction to the AWS ecosystem in late 2016 at AWS re:Invent conference, Step Functions have changed the way organisations execute a wide variety of tasks. I am going to claim in advance that this is a simple product and in order to give some objective evidence to support this we will look briefly at the issues that Step Functions exist to solve.
To understand Step Functions, we need to look closely at the limitations of AWS Lambda — stateless functions living in the cloud. Lambda was a groundbreaking service that changed the way we think when designing solutions. I think we all got our minds expanded when in 2014 they announced stateless function driven by events which was a very strong idea and I am pretty sure that you would agree with me on that.
As Lambda adoption grew, we came to realise that there aren’t many apps that have only one function, one entry point, one module or one component. It became typical to see multiple functions in the cloud, trying to solve hard real-life problems. In fact, a majority of Lambda solutions now-a-days look something like this:
You’ll notice that the arrows on the diagram above are different in colour. That’s because there is more than one way for Lambda functions to interact with each other. This is pretty much a real world application with a database that’s being used for storing data and orchestration being done by queueing and messaging technology like as SQS.
Over the last couple of years, solution architects around the globe have experienced a lot of trouble trying to meet some of the requirements listed below for their solutions:
- “I want to sequence functions”
- “I want to run functions in parallel”
- “I want to select functions based on data”
- “I want to retry functions a number of time before failing”
- “I want try/catch/finally to handle all error scenarios and avoid failure”
- “I have a code that runs for hours, how can I incorporate this into my solution?
Few approaches to achieve this are by method dispatch where Lambda functions are stored within one big binary that is coordinating execution of individual functions. Other approach would be to achieve the sequential execution of lambda functions by chaining them together using APIs. However, error handling in these scenarios is a bit tricky so we often introduce a database to manage logging, stash the state of each function. This approach is probably the most popular way of building apps today but with this, all code to insert into database needs to be developed and maintained by you. Another cloud native approach would be to use the queue to orchestrate lambda functions however this requires maintenance of the SQS queues and can quickly become overhead in any project (Link: Error handling in AWS Lambda triggered by SQS events).
All these scenarios above have led AWS to develop Step Functions that addresses all of these issues and more. It’s a fully managed service that makes it easy to coordinate components of distributed applications and microservices using visual flows. Step Functions are a reliable way to connect and step through a series of AWS services so that you can build and run multi-step applications in a matter of minutes. You can create, run and debug cloud state machines to execute parallel, sequential, and branching steps of the application, with automatic catch and retry conditions.
In a nutshell, Step Functions can help you develop low-maintenance solutions that:
- Scale out
- Don’t lose state
- Deal with errors/timeouts
- Are easy to build and operate
- Are auditable
Some of the early adopters of this service, such as Foodpanda, who participated in beta testing of step functions before the official release, claim that the service has been crucial to streamline their workflows, makes it easier to change and iterate on the application workflow of their delivery service and ultimately optimise operations and significantly improve their delivery times. Most importantly they emphasise that step functions’ ability to dynamically scale was instrumental when it comes to managing the spikes in customer orders and meet their demand.
With AWS Step Functions, you can replace a manual updating process with an automated series of steps, including built-in retry conditions and error handling. Another early adopter of step functions, The Take, states that in their case it was easy to build a multi-step product updating system to ensure that database and website always have the latest price and availability information.
Every service has downtime. Sometime when consuming RESTful services, we’ll run into transient errors that get resolved when trying to consume the service again. I’m sure that many of you have experienced this or similar scenarios. To avoid this, Step Functions have offered component retries that allow us to set a number of times you want function to be retried before it goes to a failed state. Quite a handy option to have in applications with lots of dependencies.
Another important aspect of Step Functions is the support for a try/catch/finally approach to allow applications to handle various errors. Step Functions accomplishes this by introducing a concept of defined Plan B actions that are triggered when functions experience an exception while executing. AWS Step Functions allow you to set timeouts and specific retry clauses/logic. There is also a catch clause that will handle any other unexpected errors. This is quite powerful feature and it’s possible just by adding a few lines of code to the state machine.
Another benefit of using Step Functions is parallel execution of processes, where for example the requirement is to send the data to three predictive models at the same time and choose to go for the one with highest accuracy.
In the event where we have tasks that need a long amount of time to complete, we can use a new feature called “Activity tasks”. This new concept was introduced to handle situations where a piece of code needs to run on an Amazon EC2 instance or even outside AWS ecosystem (e.g. on on-premise hardware). This is done by registering the process with an ARN and calling it from a state machine that then feeds the output back to the state machine on completion.
Considering the facts outlined above, it’s obvious that it’s very easy to use step functions as a platform to link your services together and orchestrate their execution — not only because they were built with flexibility in mind, but also because they take away the unnecessary burden required to maintain a complex platform like this by yourself.
Now, besides all the good features that step functions are offering, there are a few things I noticed that can be improved:
- Resuming state machine from any state — This seems like something that AWS should have included when the service was launched as there are many reasons why we would like to resume the process from fail state as opposed to re-running it from the beginning (e.g. long running etl jobs, activity tasks running on Amazon EC2 instances, etc).
- You are only allowed to keep the last 90 days of the execution history logs. This is something that could potentially become an issue in business critical processes where retaining history is needed for audit purposes and monitoring of production environments.
- The limit of 25,000 execution steps within state machine is relatively high, however for big organisations where ETL jobs and process are very complex it could potentially become a problem. The only way to overcome this is to continue as a new execution:
I would definitely encourage you to go through all limitations as listed on official AWS documentation page to avoid surprises when you start implementation of your workflow. Also make sure you utilise FAQs as much as possible in order to help you decide whether AWS Steps Functions comply with your requirements:
Link: AWS Documentation & FAQs
AWS case study — Customer feedback sentiment analysis
In order to get more familiar with the AWS Step Functions, my colleague and I have built a working example of the conditional and potentially complex data pipelines that can be achieved serverlessly via AWS Step Functions and their integration with several AWS cloud products, such as AWS Lambda. Click on the link to read more about this:
Find out more about Servian’s data and analytics capabilities here.