Using a Buffer Microservice and Amazon SQS to Reduce the Load on a DB

Yedidya Schwartz
WalkMe Engineering
Published in
10 min readJun 30, 2020
Photo by Noah Dominic Silvio on Unsplash

Introduction

Recently we found ourselves in a situation in which we had to prepare for a big load increase on one of our DataBases. While the easiest solution is to scale vertically (e.g. increase the DB’s machine’s power) — it might not be the best or the most cost-effective solution.

In this article, I’ll walk you through how we used Amazon SQS and a buffer microservice, to stand in virtually infinite load while saving a lot of money in the process. In addition, I’ll share a way to test your architecture against loads.

The Problem and Possible Solutions

The problem we face was a significant increase in write operations to the DB. We expected the DB machine to be unavailable because of the load and risk downtimes.

We were about to sign a contract with a very big customer. Part of the contract was the usage of a certain WalkMe feature that uses a certain DB. This feature requires a lot of Write operations to this DB.

When we came to solve the problem, we ended up with three possible solutions:

Short term solution: Set RDS to “Provisioned IOPS” storage type and increase RDS instance size, and consider turning off some peripheral DB-exhausting controllers.

Midterm solution: Create a messaging mechanism between the most DB-consuming services using one of the standard messaging technologies stacks available.

Permanent Solution: Move Relational Database to work with NoSQL solutions that are comfortable with high-scale and performance.

The short term solution was rejected because a short calculation showed us we should expect the load to multiply by around 20, and it is very likely that while we have autoscaling in place, vertical scaling will not only cost a lot — but will also result in downtimes.

Because the time and resources for the refactor that the permanent solution requires was estimated as very high, the permanent solution was also rejected at this point.

We expected the load to be in Write operations and not Read, so we did not have to deal with immediate response to the client. Hence, we came up with a resolution that was based on the midterm solution, that will control the flow of Write requests to the DB according to parameters like DB CPU usage and queries queue length.

But before we dive into the solution and how we implemented it, let’s have some background on player DB and the feature that the customer was interested in.

Player DB and User Storage

End-users interact with the WalkMe content during their usage of our customer’s applications. We can sort the data we save about users’ progress into two types:

  1. Data stored in the client (local storage for instance) — this data is not data that is critical between machines or to keep track of, so if a user loses his data, the customer is not harmed by it in regards to data loss.
  2. Data that is stored server-side — this data is critical for our customers since it is data that’s being used for BI purposes, certifications, onboarding process, etc.

The new Big customer needed the 2nd feature… on steroids.

Figure 1: WalkMe Onboarding Tasks menu. The grayed out tasks are completed tasks. The information about completed tasks per end-user is stored in the Player DB.

The DB that stores the user storage data is called Player DB, and it is separated from our other DB servers (from the Editor DB, for example).

Therefore, the main load threat was pointed at the player DB. Once the new customers’ end-users will start massively writing their user storage to the Player Server, Player DB might have difficulties in handling such a large amount of requests. At this point, we understood we must think of a way that will promise the DB’s 100% up-time in any situation.

The Architecture Before the Change

Figure 2: Player server old architecture (diagram by Pavel Shklovsky)

As can be seen in figure 2, requests from the end user’s interaction with WalkMe are sent from the browser to the Player Server load balancer using route 53. The load balancer redirects all traffic to one of the healthy Player Server EC2 instances. Those instances are governed by Player Server Auto Scaling Group.

Player Server processes the requests and produces queries which are sent directly to the Player DB.

In the case of a significant amount of requests to the Player Server, the Auto Scaling Group policy will increase the Player Server instances number. While this is good for server stability — the single Player DB won’t be able to handle such an amount of requests.

Because most of the requests are write-requests and not read-requests, we don’t need to get an immediate response back to the client.

Therefore, our solution constraint was more availability (we want the DB to be available) than consistency (we do not need the data read to be accurate in real-time, so we can allow for a delay in Write operation).

The Final Solution: Messaging Queue and a Buffer Overflow Manager

Figure 3: Player Server new architecture (diagram by Pavel Shklovsky)

As can be seen in figure 3, the Player Server architecture hasn’t significantly changed. In the new architecture, instead of sending requests directly to the Player DB, the Player Server produces messages into Player SQS (AWS messaging queue).

The consumer of Player SQS is a microservice called the Buffer Overflow microservice (Buffer from now on). This new element in the system digests the messages from the queue and feeds Player DB without exceeding DB limits (Write IOPs / CPU util). The Buffer may have a few instances, as a function of the number of queries it needs to pop out from SQS. The amount of digested queries is set according to several live parameters taken from the DB machine.

The Buffer

The Buffer’s job is to pop messages from Player SQS in a controlled manner. The tempo is decided by several live-parameters, which include the DB CPU usage, requests queue length, Buffer instances number, and more. Every message which is popped by the microservice is sent to the DB, i.e: only the microservice has write access to the DB; Player Server never has straight write access to the DB.

This is the key part of the overflow buffer microservice. It triggers the health check and monitoring loops. These loops periodically query the relevant stats and decide how many messages will be popped from SQS on each iteration.

Overflow Buffer microservice entry point

The logic is based on a multiple throttling mechanism: Only when all the promises are correctly returned, the SQS keeps popping messages. Each promise represents a throttle condition.

In software, a throttling process, or a throttling controller as it is sometimes called, is a process responsible for regulating the rate at which application processing is conducted, either statically or dynamically. (Wikipedia)

What do our throttlers know to check?

RdsHealthThrottle (RDS is Amazon Relational Database Service)Checks if the DB machine CPU is 90% and above. If so, it does not allow message pop from SQS (About the implementation, see appendix B).

DbQueueThrottle checks if there are enough DB connections in the connection pool for the current worker, and if not, waits until there are free connections (Also see Semaphore).

EmptyQueueThrottle checks if the SQS is empty, and if so, checks again after 1 second.

MaxThreadsThrottle checks if there are more than X node instances running in the microservice, and if so, waits until they are decreased (configured to max 100 threads now).

RdsThresholdThrottle Controls the pop pace according to the DB parameter. For example: 72% CPU will decrease the pace by 5%, 74% by 8%, etc.

Validating the Architecture

Our new architecture is supposed to stand a load test. In order for us to see if it does, we’ll formalize our hypothesis so we could put it to test.

Given a peak in traffic equals to 5000 requests per second, and a DB machine that can handle 250 requests per second, we’d expect to see the following:

  1. An increase in the player requests.
  2. An increase in the DB CPU usage.
  3. Followed by an increase in queue size.

After the peak we’d expect to see a decrease in all parameters:

  1. Player requests.
  2. Followed by a decrease in queue messages.
  3. Followed by a decrease in DB CPU usage.

Most importantly — we expect the CPU usage of the DB machine to not go over 90%. For this, we need 2 things: A way to emulate the load and a way to access and visualize the expected data.

The flood is here

In order to simulate the load we used a tool called Flood.io. This tool gives us the ability to create huge traffic to the server with a simple JMeter script.

Flood.io also lets us use our private company AWS account EC2 spot servers as the agents that send the requests — this fact guarantees the low cost of the load testing run.

With this tool we were able to run a load test of millions of requests only for ~20$ per run. So we could see if everything works as expected, change values and configurations if needed, and run again and again until we have the perfectly fine-tuned the process.

Figure 4: Flood.io interface while running a test

The “full picture” through Grafana

Just throwing traffic at our system will be worthless without actually seeing the effects. For this we used Grafana.

Grafana is an open platform for analytics and monitoring, which is used to visualize all the relevant project metrics in one dashboard.

We added all our microservice metrics, together with the DB machine performance metrics, to one Grafana dashboard, which gathers all the data and provides us the “full picture”. We also added the queue status using data we’ve added to CloudWatch so Grafana could get all the needed value live from one source.

Figure 5: Player Server Grafana dashboard

Figure 5 is taken from our test environment while performing a load test. This screenshot actually tells the whole story through its visualizations.

In the test, we simulated a scenario we expected to have in production. Coming back to our hypothesis, if it is true, we expected to see the following:

  1. Figure 5.D: Player Server gets a lot of requests (~20 million).
  2. Figure 5.E: Player Server Auto Scaling Group increases the Player Server instances in order to handle these requests.
  3. Figure 5.A: Player DB (in the dashboard called: Audit DB) CPU jumps to more than 90%
  4. Figure 5.F: Buffer overflow starts to throttling the DB writes, and as a result, the messages queue is increased.
  5. Figure 5.B: RDS IOPS are increased moderately, in a way that the DB server can take the pressure.

It can be seen that right after the messages queue is increased (Figure 5.F), the DB CPU starts to decrease (figure 5.A). The Buffer prevented Player DB from getting to 100% CPU, and thus prevented a probable down-time. Thanks to the throttling, Player DB handled successfully the “requests attack” of 5000 requests per second. Mission accomplished.

Summary

Scaling a system is not always trivial, but the process is usually the same. Find the constraints and fit your solution accordingly.

In the case described in this article, we cared more about availability than consistency. In other words — keeping the DataBase “alive” was more important than then it being accurate in real-time. If this is your case, this solution might fit your needs.

During the description of our solution, we went over technologies like SQS for managing a message queue and a buffer microservice that’s responsible to control the flow of requests to a DB (or another service).

I also described the dual role of cloudwatch — which helped us have live data regarding the DB’s health as well as store and supply the relevant data to our Grafana dashboard.

Using tools like flood.io and grafana combined with metrics coming from cloudwatch allowed us to test our architecture. In general, flood.io interface was easy to use, the tool worked great and I highly recommend it.

I believe that the most important method in this article is how to test your design — and try to improve it iteratively.

Thanks to Yonatan Kra and Minor Ben-David.

Appendixes

1. Some important configurations

Here are a few SQS, RDS and general configurations that I find important to mention, to illustrate how hard it was to find the perfect fine-tuning that made this orchestra perfectly play. I’m not going to write our specific configuration values here since they probably won’t fit other cases; they worked only for our specific environment conditions.

SQS parameters:

emptyQueueWaitTimeMS: If the queue is empty, wait for X second before retry.

visibilityTimeout: Lock a message for X seconds if it’s handled and hasn’t finished being handled yet, so there won’t be double handles for the same message.

maxNumberOfMessages: Max number of messages per SQS request.

waitTimeSeconds: There is better performance when several messages are sent together, so this is the time we should wait to get the max number of messages if there are not enough messages to handle yet.

RDS and Buffer Overflow microservice parameters:

RDS pool connection: X

Buffer Overflow microservice pool connection: Y

Max Buffer Overflow microservice instances: less than X * Y, because if there will be too many instances, RDS won’t have enough connections for them.

2. CloudWatch SDK

To read data from CloudWatch and get live data of the DB machine in the Buffer, we used a nodejs integration library.

we specifically use AWS.CloudWatch.GetMetricDataOutput.

--

--

Yedidya Schwartz
WalkMe Engineering

Backend Tech Lead | DevOps | AWS Community Builder | AWS Solution Architect