Migrating to AWS Step Functions

Royson D'Silva
5 min readFeb 11, 2022

I’ve been working as a freelancer with Manu Rana at Gather6 for past few months. In this brief time I had an opportunity of working on the entire stack. Before we dive into the migration bit, let me give a brief about the stack used at Gather6 and the problem we hoped to solve with migrating to AWS Step Functions.

The Stack

The frontend is a React web-app, while the backend is GraphQL API provided by Hasura. Hasura makes it easy to generate the GraphQL API directly from the Postgres database without having to write code. Along with this it also provides capability to listen to changes on database and handle it via webhooks with Event Triggers. We had Event Triggers for booking seats after payment confirmation from Stripe or even handling analytics & notifications with CleverTap. In our case all the Hasura event triggers would invoke a single webhook API which triggered a Lambda function with a long switch case that in-turn handled the various logic. Something like the example below:

index.ts
folder structure

The Problem

If any handler throws an error the webhook would return status code of 500. Hasura by default retries the webhook on receiving an error. The problem with this was at times a particular handler would get called more than once. Example: In the above code if handle2Event1 throws an error, handle1Event1 is called multiple times. Also there is no way of calling only handle2Event1 after the error has been corrected.

The second problem with the above implementation is managing the dependency flow. Certain functions could be called in parallel while the others depended on each other and needed to be executed in series. Managing this workflow in a single handler as the application scales would be difficult.

The Solution

To solve for the multiple retry from Hasura, the solution was to receive the data on the webhook and immediately push it to a queue and respond with status code 200 to let Hasura know that we have received the data on our end.

The next problem was to have the ability to call an individual handler when the handler throws an error. AWS Step Functions to the rescue.

AWS Step Functions allow us to create state machines, each state can invoke a Lambda and we can see the Success/Failure states for each function on AWS console. State machines also provides states that can be either executed in series or parallel, define error states and retry mechanism for each state. It also provides a graphical representation of states to visualise as the application scales.

The Migration

The first step was to push the data received from Hasura to a queue. The obvious choice would have been to use AWS SQS. The hiccup with this was SQS is not able to invoke the Step Functions and would need a Lambda to ingest the data and then invoke the state machine. Step Functions can be invoked from one of these services:

AWS EventBridge was the next best alternate for us so we decided to go with this. The webhook on receiving data now pushes it to an event bus. The event bus in turn invokes the state machine.

Since each state was to invoke a different lambda the next step was to create all the different lambda handlers. One approach would be to create separate module for each handler and use Lambda Layers to share the dependency. The other was to keep the folder structure same and in the SAM template.yml file provide the same CodeUri and relative path to Handler. Example:

template.yml

The advantage of latter approach is it required very little code change and since SAM uses hash based on the CodeUri it produces a single Lambda build.

Using the SAM State Machine definition and Amazon State Language (ASL) we define the step function. Example:

stateMachine.asl.json
graph representation of state machine

The following doc — Testing Step Functions State Machines Locally — provides a neat way to verify the state machines on local environment without having to deploy on AWS. We haven’t tested this out personally but this should mostly work.

The final phase of migration was writing integration tests (using Jest) against our existing stack and executing the integration tests with the new stack to verify the same behaviour.

Pros:

  • Each lambda handler follows the Single-Responsibility Principle. This results in unit testable handlers.
  • The tracking of error is easier and error states can be quickly viewed and re-executed from the AWS console.
  • We can invoke the lambda with the input that caused an error after deploying a fix.

Cons:

  • We were not able to run the EventBridge & Step Functions integration on our local. Hence any change to the code needs to be verified with unit tests or deploying it on AWS. LocalStack promises to solve this problem but we are yet to implement this.

And that’s all I learnt during this process :)

Further Reading

--

--

Royson D'Silva

Working as a freelancer + building things in spare time (both mostly in software). List of my personal work — droyson.xyz