Low-budget Banking service using Amazon Cloud: architecture, performance & cost

Vadzim Tashlikovich
15 min readMar 18, 2023

--

The success of any banking application depends on how efficiently its backend can handle the user traffic. With the rise of cloud technology, businesses can leverage the benefits of scalability, flexibility, and cost-effectiveness. In this article, I will explore the performance of a banking application backend hosted in the AWS cloud.

My tests revealed that the system with a low-budget cloud environment setup was able to handle 200 heavy create transaction requests per second (RPS), around 890 rps at peak. Operating on 30M records set. Moreover, I will delve into the infrastructure cost of hosting such a system in the cloud. My findings suggest that the cost is surprisingly low, at just $673 per month (actual on 15 Mar 2023). Join as we dive deeper into the performance and cost aspects of running a banking application backend in the AWS cloud.

Article bonus: benchmark of the test solution implementation using Java and Node.Js with a public codebase.

In the article below you will find answers to the following questions:

  • What components does a good banking architecture consist of?
  • What operations should a transaction go through to be processed?
  • How much does a banking platform cost in the cloud?
  • Is Java still a good choice for online services with a complex processing logic?
  • What income can such a low-budget system generate?

Table of contents:

What is a low-budget setup?
Measurement criteria
How to do it wrong, a straightforward/monolithic setup
Banking application challenges
How to do it right
Benefits of such architecture
Tech stack for the performance tests
Results for Node.js
Results for Java
The cost of the production system
Resolution
What income could such a system generate?

What is a low-budget setup?

For my experiment I used only low-budget components from the AWS price list. The goal was to get the cheapest components that do the job. Example prices (on 15 Mar 2023 for US East region, Ohio):

  • PostgreSQL as a database; setup: db.r5.large (the 2nd cheapest one with 2 CPU’s) x2 CPU and 16GB memory costs $171 per month (US East Ohio);
  • Rabbit MQ: mq.m3.micro, 1 node setup;
  • EC2 with 2 CPU and 2GB memory costs only $14 per month;
  • Simple ALB Load balancer for routing of user requests, standard minimal price;
  • Route 53 for DNS management, standard minimal price;
  • CloudFront for serving of static files, standard minimal price;
  • NAT for working with dedicated fixed IP, standard minimal price;
  • SES for working with notifications, standard minimal price.

This is a minimum technological component setup you need for a banking system. Why do we need all these components and what they actually do? I will cover it all sections through the whole article.

Measurement criteria

To measure the performance of the system, I focused on the most critical component, which is the transaction creation process. This means that the requests per second (rps) figures that I obtained only apply to create/write operations. In reality, a banking application involves more read/GET operations, with a ratio of about 20/80%.

Although the application I tested is a real banking application that involves a range of other functions besides transaction creation and updates, the infrastructure I used has the potential to support account management, KYC, and other features.

My main focus was to determine how many transactions a banking application can create per second in a cloud setup. To achieve this, I utilized Artillery (https://www.artillery.io/), a tool for load testing. The test scenario was simple, calling a transaction-create endpoint. I conducted a warm-up session of 10 seconds and then a high-load session of 60 seconds, and for the final setups, I used longer sessions of 5 minutes.

It’s worth noting that the test scenario involved only 500 accounts. This approach allowed me to simulate high activity among a small group of users and a situation where there are many transactions per account in a short amount of time, such as a partner system communicating with ours via a REST API.

You can read here about the PostgreSQL records insertion speed using the cheapest AWS DB.

How to do it wrong, a straightforward/monolithic setup

Basically one may think that a simple monolithic setup would be a good solution to serve all user requests for a banking application. By the monolithic approach here I mean a single service scaled horizontally as on the picture below:

Classic straightforward approach architecture approach

In this diagram a single-purpose API service does everything:

  • user and operator authentication;
  • user onboarding, accounts & payments management;
  • all operator operations regarding reporting and back office management;
  • user notifications;
  • integration with a partner bank.

This setup can indeed have its place in the real world. But only for demonstration purposes, as a Proof Of Concept.

Though the cost of this infrastructure is very attractive, it has a lot of critical flaws: lack of security, low processing speed, conceptual mistakes while processing several transactions in parallel for 1 banking account, instability, lack of compliance.

Banking application challenges

Before “doing it right” let’s have a look at the basic requirements a banking application should fall under:

  • high-availability (HA), always be able to tell customer what happens with his banking account and his payments, 24/7, even under high load;
  • be secure and compliant with several standards and banking regulatory requirements;
  • be able to interact with several partner banks;
  • be able to correctly process accounts’ balances in parallel.

In terms of transaction creation, let’s have a look to a basic set of what should be done to successfully process a transaction:

What should be done to create and process a transaction

So to create a transaction, the following list of actions should be done:

  • check account balance;
  • check user/account limits;
  • calculate a fee;
  • calculate exchange rate (optional);
  • check blacklists;
  • check beneficiaries;
  • check on fraud;
  • reserve balance;
  • store the transaction;
  • deal with a partner system.

Some of these actions need to be done within 1 database session, which takes a lot of time. Some of the operations are of an asynchronous nature — we cannot block the customer for a long time, making her wait for an answer.

And evidently this cannot be done in a monolithic app:

  • What happens to the customer if the partner system does not respond now?
  • What if there are several incoming/outgoing transactions for 1 banking account that require a balance correction, when the account is blocked by another operation?
  • How can we scale?

How to do it right

The answer to solve the majority of the issues is moving to promises. So the system does not make all actions at once but performs only the critical checks fast and promises to accomplish the rest later. This way an asynchronicity comes.

At first, the application performs the most critical operations like checking the available balance, daily/monthly limits, calculating a fee, etc. And then the system leaves the user with a promise that the transaction would not be forgotten and would be processed for 100% somewhere in the future. So we need queues.

Second, the concurrency over a single banking account should be eliminated. That means for a single account we need to process transactions one by one. If we make a transaction processing service work in such a way with all accounts, we will stick to the performance of each update/create operation. If a transaction takes 50ms to accomplish, then we have a throughput of only 20 transactions per second.

To scale the system, we should route all requests to the appropriate current account-related dedicated service by hashing the requests. This approach results in a set of working services, each processing transactions for a single account one by one, which can multiply the system’s productivity. More account-related groups lead to more services and greater productivity.

Finally the design looks like this:

“The right” approach

Warning: this is the least a low budget setup that can guarantee for scaling, correct transaction processing, correct communication with banking partners and HA. But it’s missing yet:

  • separate authentication service;
  • separate notification service;
  • a fraud detection service;
  • onboarding service;
  • dead-letter queues and appropriate processors to become more fault-tolerant.

The list can be continued according to the budget and the wish to reach higher marks in security/availability/resilience/observability fields.

Benefits of such architecture

Let’s overview a sample transaction lifecycle path:

  1. Browser sends transaction details to the API service;
  2. Stateless API service performs the following operations: calculates fee, checks available balance, checks user/account limits, returns user an error if something went wrong;
  3. If everything is fine API service creates a new transaction message in the appropriate queue after hashing;
  4. When a Processing service gets to the message in the queue, it performs all other operations and: silently processes transaction, passes it to the partner bank OR notifies user via a notification channel if something went wrong;
  5. When a partner bank confirms the transaction completion, it notifies our system connector;
  6. Partner connector puts the transaction status change message to the queue for a processing service after hashing;
  7. Processing service finalizes transaction processing.

Benefits here:

  • user does not wait, we promise to process her transaction;
  • system load is well-distributed;
  • system has enough time to process a transaction and to correctly communicate with partner systems;
  • we can wait for partner response for hours and days, not milliseconds or seconds;
  • the correct order of transactions is followed, account balance is always updated correctly;
  • system is open for CQRS improvements, because the API services only read and does not write; it’s only the Processing service that writes.

Tech stack for the performance tests

I tried two technical approaches in my experiment to reveal how fast such a banking system can be. Backend was created using Node.js and Java. Cloud components with the same characteristics (CPU, Memory) were used, of course. AWS PostgreSQL RDS (db.r5.large, 2xCPU, 16Gb Memory) was used as a database.

If interested, you can find sources of the experiment in the repository.

Node.js stack — straightforward/monolithic

For the first, monolithic architecture, when 1 service does all the operations at once:

Here REST-API service was run in a cluster using PM2 capabilities. So in total one Node.js service per each core for x2 CPU. Used components versions:

  • Node.js version: 16.15;
  • NestJs framework, version: 8.x
  • Sequelize framework, version: 6.18.x
  • Database: PostgreSQL, version: 13.2.

In the experiment the following list of actions was implemented before the transaction creation:

  • generating transaction ID (UUID);
  • creating a signature (for idempotency check);
  • simple fee calculation based on transaction amount;
  • signature lookup;
  • checking if customer and account are enabled (not blocked);
  • check user monthly limit;
  • check beneficiary party in the blacklist;
  • check if balance allows to perform the operation;
  • initiate a DB transaction with a specific record LOCK time;
  • in a single DB transaction: getting account, marking select for update; inserting a new transaction record; updating account balances (available and reserved); updating user monthly limit.

Node.js stack — the right one

The setup to test the “right” approach looks like this:

RabbitMQ version 3.8 was used here.

In this setup REST service was doing only these operations:

  • signature creation (for idempotency check);
  • calculating a fee;
  • signature lookup;
  • checking if customer and account are enabled (not blocked);
  • check user monthly limit;
  • check if balance allows to perform the operation;
  • generating transaction ID (UUID);
  • hashing for selecting a message queue;
  • sending a message to the appropriate queue.

And the Processing service (qwriter) was doing this:

  • generating transaction ID (UUID) (optional);
  • creating a signature (for idempotency check);
  • simple fee calculation based on transaction amount;
  • signature lookup;
  • checking if customer and account are enabled (not blocked);
  • check user monthly limit;
  • check beneficiary party in the blacklist;
  • check if balance allows to perform the operation;
  • getting account before balance correction;
  • inserting a new transaction record;
  • updating account balances (available and reserved);
  • updating user monthly limit.

So REST service in this setup was doing much less operations.

Java stack — straightforward/monolithic

As for Java, these are the tech specifications:

  • Java JDK 11, default heap size: 512 MB;
  • Spring Boot 2.7.* framework was used.

Java stack — the right one

The “right” setup was checked as well. Where a Processing service (qwriter) is a single Java service and 8 Node.js services.

Alas, the setup with 8 Java dedicated threads for each queue was not checked. Instead I focused on how fast Node.js can be as a consumer service.

It’s worth mentioning that for transaction storing in PostgreSQL a table partitioning was applied for a “created” column of “timestamp” type. This allows you to decrease the load on the main table using a monthly granularity (this frequency can be changed on your needs).

Results for Node.js

Around 100 test sessions were made to make sure the results are stable and do not deviate. But at first, it was necessary to understand what is the maximum number of POST requests that could be transferred and served by the server if it does nothing. This setup was used:

The number is: 650 dummy rps for a single Node.js process.

The first result — numbers for a straightforward/monolithic setup for Node.js when working with 1/15/30 millions of transaction records in the database:

RPS for different volumes of data for a classic approach

Explanation and details:

  • 75 rps means that it took around 13ms to execute 1 transaction for 1Mln data set;
  • 35 rps means that it took around 29ms to execute 1 transaction for 30Mln data set;
  • with each new 15mln records SQL select queries are getting slower, giving less time for a single service to process 1 POST request;
  • beyond that number (80 rps for 1Mln records, 50 for 15Mln, 40 for 30Mln) the system hit the deadlock which led to total collapse.

The numbers for the “right” setup:

RPS for different volumes of data for “the right” approach

Explanation and details:

  • exploratory analysis has shown that using this infrastructure setup, 6 is an ideal number of Node.js processing services; higher number didn’t give any significant performance boost;
  • 6 processing services allow to process and create transactions with a rate of 168 rps, even on 30Mln data sets;
  • since we are using an asynchronous approach, the REST API service can take much more requests than 170; I got boosts for up to 400 rps.

Significant difference from the monolithic approach here is that the system is now much more performant and much more stable. Even beyond 400 rps the test stand is working fine but response latencies are starting to grow up to 1 sec. Of course this cannot be counted for a good UX, but still the system is responsive.

Here is how P99 percentile latency (ms) grow within increasing load using the optimal setup:

P99 latency per rps

Important note here — boosts over 200 rps do not come for a good price. The greater speed of incoming requests — the longer it takes to process them all, because the processing units still can work only with 200 create-transactions per second. So if the system experiences 400 rps for 1 minute, that means that last user transactions would be processed only 1 minute later. It’s not acceptable. Users should get a response with a successful message in a maximum of 3 seconds. It brings us approximate allowed spike bursts of 400rps per 3 seconds, not more.

Example of increasing queue load for a big incoming request rate of 400 rps:

Message queue state over time for a burst load

Example of good and bad incoming request rate spikes:

Different spikes for good and bad UX

Here users in blue spike await for a transaction success status message (a promise) for max 3 seconds. Users from the green diagram can wait up to 10 seconds.

Results for Java

According to my experience, pure CPU-bound tasks are completed about x3 times faster using Java compared to Node.js. That is for the fundamental benefit: bytecode compilation.

A banking/payment application contains a mixture of operations. There is a great portion of I/O tasks. The most important issue in our case: the faster code — the less time a database connection is blocked. And this is crucial.

For a straightforward/monolithic setup Java service managed to work stable up to 100 rps before the deadlock. P99 percentile latency was pretty attractive — 96,6ms average.

But for the “right” setup Java was used only for REST API in my tests. On the processing side I managed to run 8 processing Node.js services which gave a stable 200 operations per second even on a big amount of transactions in DB. Java REST API service demonstrated an incredible boost up to 890 rps! And it seems it is not the limit.

Latencies during an increasing load:

P99 latency per rps for Java setup

As you see, latency is almost always the same. It is practically not increasing with an upcoming load. During the tests I could not get higher rps but evidentially the system can hold more.

Probably the traffic capacity for my AWS dedicated internal network was hit. Anyway, Java REST API is at least x2.2 faster compared to Node.js solution for this sample banking functionality. I assume that after some tuning Java could hit the 1000rps mark easily.

The cost of the production system

Here is the production system costs calculation. Banking regulations affect the system architecture, software complexity and the final cost of the infrastructure. The most crucial factor is high availability (HA) compliance requirements. Basically, it means that the system should be replicated and survive in case of point failures or burst load. Compute nodes for services are the ones that were used for tests. They have enough resources to hold much more complicated logic compared to the one it was used for performance measurements.

Components and their cost in AWS cloud:

  • routing, Route53, 1 zone, 50Mln queries per month, $20
  • AWS shield standard, $0
  • Cloudfront, 50Mln requests, $50
  • S3 standard, 100Gb, $3
  • Load balancer, $46
  • Secrets manager, $5
  • NAT, $33
  • SES, $60
  • x4 EC2, t4g.small (2 vCPU, 2GB memory, 20Gb disk) for authentication, REST API, processing and backoffice services — 4 x $14 = $56
  • RabbitMQ 1 node (mq.m3.micro), $30
  • x2 PostgreSQL; db.r5.large, 2 CPU, 16GB memory, 20GB of general purpose SSD (gp2), 2 x $171 = $342
  • x1 Monitoring instance for Prometheus and Grafana, $14
  • x1 EC2 for each external connector/integration = $14

The total monthly infrastructure cost: $673.

Cost of a Database component is the most expensive one. Since the system needs to have 2 instances (1 active, 1 passive failover replica) we need x2 to the price of a single database instance: $171. Could the cheaper DB be used? Yes, db.m6g.large (2vCPU, 8GiB memory, $120) could be used. But with lesser memory a big set of transactions will cause rps to drop down to 100, if not lower. A database will not be able to flush data fast from memory to SSD disks. So cheaper DB is possible when there are less transactions. But we want to earn more on more transactions, right? 😀

A note about Kubernetes. Setup where nodes are managed by Kubernetes is a very good option. It can give more speed in terms of infrastructure operations, monitoring. It can decrease costs on maintenance. What would be the cost of the production system based on Kubernetes? It’s right here:

  • Routing, Route53, 1 zone, 50Mln queries per month, $20
  • AWS shield standard, $0
  • Cloudfront, 50Mln requests, $50
  • S3 standard, 100Gb, $3
  • Secrets manager, $5
  • NAT, $33
  • SES, $60
  • RabbitMQ 1 node (mq.m3.micro), $30
  • x2 PostgreSQL; db.r5.large, 2 CPU, 16GB memory 2 x $171 = $342
  • x1 Monitoring instance for Prometheus and Grafana, $14
  • Kubernetes EKS cluster cost: $74
  • x2 EC2, a1.xlarge (4 vCPU, 8GB memory, 50Gb disk) as multi-purpose Kubernetes nodes — 2 x $79 = $158

The total monthly K8s-based infrastructure cost: $789.

Resolution

In this article you discovered a banking application stand that is very close to the real one:

  • a more or less exact list of operations performed for each transaction creation;
  • infrastructure capable to interact with a any number of integrations;
  • system capable of processing pretty good amount of daily active users;
  • system is tuned to be HA and compatible with PCI-DSS and banking regulations.

Such a system is capable of creating 200 new transactions per second. It can hold load spikes up to 890 create requests per second.

When keeping read/write requests ratio of 80%/20% in mind, we can expect this performance:

  • ~650 rps of mixed read/write operations in normal mode;
  • ~2800 rps in high peaks.

Such a performant system, compliant and HA ready, costs $673 per month in AWS.

And Java still proves that it’s a highly performant language for a backend.

What income could such a system generate?

Some fantasies on what income can such a system generate.

Let’s assume that mainly users are active through the daytime, 16 hours out of 24. The average rate of new transactions per second is 16. This rate will generate 30Mln new records per month. The average transaction fee is $0,50.

So it’s $15Mln 💵. Hm, not bad for a $673 cost system!

If you have any questions, comments, improvements — do not hesitate to contact.

Link to the experiment codebase.

Icons by: https://olkeen.com/portfolio/vector-illustrations/.

Photo by Ferran Fusalba Roselló on Unsplash.

--

--