How to Tame Serverless Batch Processing with AWS Step Functions

Eyal
Melio’s R&D blog

--

Do you work in a Serverless architecture? Do you have batch processing workflows? Are you using AWS as your infrastructure? If your answer to all these questions is “Yes!”, then you may have run into the same hurdles that we have here at Melio.

Melio allows small business owners to pay their vendors and contractors, and vice-versa. As a payments platform, we must follow certain laws and regulations. One of these regulations is a sanction screening process of our users, which is performed in batches.

In this post you’ll read about our initial design using “traditional” orchestration solutions — such as Lambda and SQS — and about why and how we transitioned to AWS Step Functions. This was our first attempt at taming batch processing with Step Functions, and I’d like to share what we’ve learned from it.

Initial design

Our screening process involves matching our users’ data to public lists of sanctions and PEPs (Politically Exposed Person). Since these lists are regularly updated, we need to repeatedly screen our users.

For legal reasons, we use a third party service to perform the matching logic for us. They provide us with an asynchronous API that accepts a batch of 10K users. However, this API is limited to one invocation at a time, and can take around an hour to complete a single request.

So in order to implement the screening process, we broke it down into three steps:

  1. Batcher: Query all our users and split them into batches
  2. Caller: One by one, invoke the third party service with the batch, and poll for the response
  3. Processor: Process the response, and update our database with the results

Here is a diagram of this design, based on SQS and Lambda as orchestration tools:

A bit complicated, wouldn’t you say? That’s because this design needs to overcome several challenges and pain points associated with the Serverless architecture. If you’ve followed our blog in the past, you may have read “How we built Melio’s payments platform on AWS Serverless” by Omer Baki that discusses some of these challenges.

4 main pain points

Our first challenge was how to throttle our requests to the third party service. Remember, we must send one request at a time, and wait for it to finish before sending the next one.

Lambda provides a general solution for throttling, via its reserved concurrency. However, this solution is aimed at high concurrency numbers, and doesn’t work well with low concurrency numbers (1 in our case).

The second challenge we faced is how to poll on the requests to the third party service.

As you probably know, a Lambda function can run for 15 minutes at most. This does not cover the time that a request to the third party service may take. Furthermore, using a Lambda function to wait for something to happen is not very cost-efficient.

The third challenge was how to make things visible, either when developing the workflow or during incident management.

Other than the diagram, there is nothing that visualizes the workflow, and documentation is often left outdated over time. During an incident, logs are usually the go-to, but they do not provide a clear method of pinpointing the step where the workflow failed. This is especially true once we include steps that run parallel to the rest of the workflow; for instance, the processing of the result from the third party runs that way.

Lastly, there’s a challenge with how to cancel the workflow while it’s running, or restarting it after a failure. There isn’t an easy way to do that with our initial solution.

Step Functions to the rescue

Step Functions provide us with solutions to the above pain points, by introducing something called “state machines”. A state machine is a series of steps that carries a context (list of data variables) throughout the execution.

Each step of a state machine is called a “state”, and the data is passed around the states as a simple JSON object. A state can invoke a Lambda, or it can be any one of a set of control flow operations — if-else, loops, wait, failure and success.

In addition, a state may invoke another state machine (even its own) — triggering a new state machine execution — and it may choose to either wait for it to finish or not.

Painkiller

In our new design with Step Functions, we built each of the screening process steps — Batcher, Caller, Processor — with a distinct state machine.

This is a diagram of the first step — Batcher:

The first state splits our users into batches (via Lambda), and then uses a Map state — i.e, a for-loop — to iterate through them and invoke the next state machine one at a time. This solves the pain of throttling.

Here is a diagram of the second step — Caller:

Here, the first state calls the third party service, and the next series of states periodically poll for its status. Once the request is completed, another state fetches the result and finally invokes the next state machine.

This demonstrates how a state machine can be used to poll requests. Calling a request, checking its status and determining whether to continue polling, is all made by states that invoke a Lambda. The Lambda contains the logical code that evaluates these conditions.

Over and above that, the state machine provides the states Choice (if-else) and Wait that can look into the response of the Lambda states — which is added to the data context — and control the flow accordingly.

Let’s take a look at a diagram of the last step — Processor:

The first state splits the result into multiple parts. The next Parallel state iterates over the parts, processing each part in parallel and updating our DB.

All these state machine diagrams are actually screenshots taken directly from the AWS console. This creates a great sense of visibility. It is updated live as the state machine executes.

Furthermore, the AWS console provides the ability to cancel a state machine mid-execution, restart it with the same input, or start a new execution with ease. SQS and Lambda are missing these straightforward control operations.

Wrapping up with a recommendation

The new design using Step Functions is what we ended up implementing, and we are very satisfied with it. The periodic screening process has been running for a few months now, processing around 300K users a month in total, and providing us with easy maintenance.

It’s important to note that there are certainly a few cons to Step Functions.
In short, the lack of IDE support and the low-code configuration itself, which is hard to read and debug. Yet, these still pale in comparison to the challenges presented by complex Lambda + SQS workflows.

In conclusion, I would recommend you consider and try using AWS Step Functions in your Serverless architecture for any workflow that is a little more complicated than what a simple Lambda and SQS can achieve. Especially when the workflow is solely a part of the domain of a single (micro)service.

Visit our career website

--

--

Melio’s R&D blog
Melio’s R&D blog

Published in Melio’s R&D blog

Melio is a TLV-based startup on a mission to keep small businesses in business. Our online payment solution enables small businesses in the US to pay their bills in more efficient ways that improve their finances and free them up to focus more on managing the business they love.

Eyal
Eyal

Written by Eyal

Backend Engineer @ Melio | Premature optimization is the root of all tech debt