Summarizing 2 years of using Serverless Architecture

Omer Baki
Melio’s R&D blog
8 min readFeb 5, 2023

--

This post is a summary of a talk I did a few months ago in a Serverless Conference in Berlin.

In this post I would like to share how using serverless architecture affected our work in Melio in the past 2 years:

  1. Handling fast business growth and scale
  2. Extending our R&D teams

What do I mean by Serverless?

My definition of serverless is a general approach of choosing managed services (specifically on AWS). We choose to use lambdas, managed message brokers like EventBridge and managed databases like Aurora and DynamoDB.

I find this approach quite organic to our development process because it doesn’t carry the overhead of maintaining these services (upgrades, custom integrations, etc.) and add autonomy to developers.

What is a service?

In general, I think serverless forces you to re-think the definition of what is a service. Since the service is constructed by native cloud components, it can almost always be broken down into a smaller scope. For example, creating a flow that reads a file from S3, parses it, transforms it and saves it to DB can be constructed from a few lambdas connected by queues, sns or other brokers. Alternatively, running this logic on a server could be pieces of code that performs all these tasks in a single session.

In this sense, I find it funny to see the endless discussion about microservices and monoliths. It seems the decision of the way to go is determined by which family you were born to :-)

I was born to a family that supports microservices, so it can explain part of the decisions described in this article. We try to follow patterns of domain driven design and think of our bounded contexts through these changes.

Handling fast business growth and scale

There’s no magic that can be done that suddenly fixes your system’s weak points. You also have to avoid the decision to rebuild everything from scratch. Instead you need the ability to do surgical changes and manage to do them gradually, continuously and fast.

The lifecycle of development is that you start with a naive solution and you reach its limitation. We want to move fast, not over engineer our solutions and make sure we’re building the right thing.

Aligned with this philosophy, we try to follow Amazon’s Serverless Design Principles. I will give a couple of examples in this article.

Naive Solution for payment processing

We start with a simple, batch flow, that reads payments from the DB. It validates the payment data, transforms the data and sends it to the queue. From there we aggregate these entities into file formats that are accepted by the bank.

This solution worked like a charm when the company started and lasted for a year or more, handling relatively low scale and slow growth.

In the past 2 years, Melio was able to establish a few growth streams that produced a large number of users and a large amount of payments volume. This simple solution, naturally, started to reach its limits.

Speedy, simple, singular

Pain — Hitting Lambda timeout

Creating a naive batch process might, at some point, reach lambda’s 15 minutes limit. Regular servers would not have this limit, however the 15 minutes limitation emphasizes that parallel processing is actually the correct approach.

Solution

We break down the batch process into two steps that are connected by a queue:

  1. Fetch all payments and add them to the queue
  2. Handle a single payment per lambda — validating and transforming payments and send them to the next step of the process.

Use events to trigger transactions

Pain — DB writes are expensive

With scale, we needed to update thousands of rows in the DB. We would read the payments uploaded to S3, parse the file to get the list of paymentIds and update the DB. The action needed more resources, it became harder to recover on DB failures and took more time.

Solution

Same as before, we want to split the action and connect them using a queue.

  1. Read all payments needed to be updated from the file in S3 and add them to the queue.
  2. Updating the DB using smaller, defined batch size updates for DB optimization.

Design for failures and duplicates

Pain — Process payment at most once

When processing payments, idempotency is critical. Sometimes we have to write to different storages and it can’t be handled with a transactional approach and we want to make sure we process the payment at most once.

Using the naive solution we encountered the following issues:

  1. Uploading files was done one by one
  2. SNS/SQS are not idempotent for over 5 minutes

Also in case of AWS regional downtime. We want to save the state and be able to continue from another region if necessary.

Solution

Instead of creating the files in one iteration, we use DynamoDB to break down the process into independent stages.

  1. Create entries in dynamoDB
  2. Create Batch of entries according to the defined file size
  3. Read the batch into a file and upload the file.

This scaled out the process and added a much needed level of assurance when transferring hundreds of millions of dollars everyday.

DynamoDB is multi-region by default and it was quite easy to set up what we needed to be able to continue the processing from the same state from another region.

As you can see the solutions are simple and in fact, one of the main advantages of this process. Unlike big and complex changes, what would be the reason to postpone a simple fast improvement? There shouldn’t be any and so the system continuously improves in gradual changes that are fast, focused and makes the most impact.

Slowly but surely the system evolves into something like this:

In order to maintain this evolving system, we try to follow “Conway’s Law” which indicates that the organizational structure should reflect the architecture. For that we need to be able to extend our teams.

Extending our R&D teams

We want to keep our team autonomous and accountable. It’s hard to do when teams share code. Adding monitoring and observability is of course another challenge.

In order to split teams, we create new bounded contexts.

For example, we wanted to split our Money Movement team into 2 teams:

  • ACH (Bank Transfers)
  • Paper Checks

Naturally these teams share the existing infrastructure they worked on thus far.

When they develop new features, we want to limit the changes in shared code as much as possible if any.

Since serverless is constructed in a granular way, it’s much easier to find integration points and extend these places.

For example: The Checks team needed to add a new Check Printing integration.

In order to keep the friction to a minimum, they can simply extend the processing phase and create a rollout flag that will route payments to the new CheckPrintingService they own.

This service is completely detached from the shared code and has no effect on the BankTransfer team. Its owned by the checksTransfer team and monitored by them.

Using Serverless architecture produced a solution that connects the processing and the check printing phases through SNS events. The team chose this point to integrate with, by simply differentiating events from the “old” check processing flow to the new check processing flow.

Summary

Serverless allows you to divide and conquer problems

Serverless is pushing us to compose small distributed components and create small changes.

There’s always an excuse to postpone big changes — “We’ll do it in Q2 since Q1 is packed with product features”. But there should never be an excuse to postpone critical small changes. Like fixing P1 bugs. Divide and conquer problems helps you break down big problems into smaller ones, by doing surgical changes.

You can always decide not to do 100% of the small critical changes. But there’s never an excuse to do 0%.

Serverless limitations forces refactoring

There’s always a discussion around Serverless limitations — 15 minutes, memory usage, limitation of small scope, etc. I find these limitations an advantage rather than a disadvantage, since it forces you to make changes and not postpone them. If you reach these limitations it means there’s a problem with your architecture or with the implementation.

Divide an area into sub-domains

Teams need to be autonomous so they can make progress. Untangling existing code is hard, but with serverless architecture the design forces you to break it down into native cloud components that enable you to do these changes more easily.

Next Challenges

  • Debugging our services locally is a challenge. Because of the broken down approach, we also limit our ability to run the entire service locally. This forces us to rely heavily on our contracts. We developed a way to debug locally, services that are deployed to our test environments (this will be described in another article)
  • End to end testing is also a challenge. Since the flow is broken down into components, it’s harder to simulate the entire flow in our CI. The best option is to have dedicated AWS environments to run these flows on.

Thanks a lot for reading!

Visit our career website

--

--