How we saved more than $1M with an in-house EBS auto-scaling mechanism

— — — —

Eli Vaknin
skai engineering blog
6 min readDec 20, 2023

--

The Problem

We’ve recently migrated our decades-old monolith application, internally codenamed “KS”, from our own datacenter to AWS.

Each KS (of which there are hundreds) is composed of two EC2 servers: one for the database and the other for serving the front-end, with a total of 9 disks.

KS EC2 servers within our infrastructure diagram

While in the datacenter we had our physical storage and we could allocate it as needed at no additional cost, on AWS we will pay more to get better disk performance characteristics (IOPS and throughput).

When we started the migration project, the initial suggestion was to go with IO2 for our database and other performance-critical disks. Since IO2 is very pricey, we thought we would start with the cheaper GP3 while allocating the maximum IOPS allowed, but we knew this too would eventually be a costly overkill for at least some of our servers. To mitigate this, we adjusted the values manually based on our monitoring system as soon as we started the rolling migration to the cloud. Soon enough we realized that the manual approach won’t scale, as we have around 550 instances, which means 49.5k disks to adjust.

One KS performance characteristics dashboard

It was clear an automated solution was required to optimize our disks’ performance characteristics.

Home-grown vs. off-the-shelf

We started by searching the market for an off-the-shelf solution to tune the IOPS and throughput automatically. There were a few players that started offering solutions for this problem, but they were in the early stages and didn’t meet our requirements, which are summarized below:

  • Dynamically setting the value — the system needs to set the values (up or down) based on the actual usage of each disk
  • No need to replicate (and then maintain) more EBS volumes, as some of the off-the-shelf tools do
  • Accuracy — the system had to be as accurate as possible, as AWS allows only one change for EBS every 6 hours
  • Reliability — setting incorrect values can cause degradation of system performance, which we cannot allow for our clients
  • Controlling the frequency of changes — as AWS allows only one change every 6 hours, it was important for us to control the frequency of the changes: if we were to change the values too frequently, we would not be able to increase disk size when we need to
  • Tuning the sensitivity of the system — we wanted to have the ability to control the system manually (e.g. enforcing a lower IOPs/throughput limit) and put thresholds to limit the amount of configuration changes
  • Automated system — manual intervention should be as minimal as possible

After exhausting the search for a solution that provides all of the above, we decided to build our own solution.

Building the solution

In light of the requirements described above, we started to design our solution.

First, we had to decide how and where this system would run. One option was Jenkins, which we’re as the orchestrator for many tasks in our organization. Still, we wanted the system to be faster and more responsive than the typical Jenkins job (with Jenkins’ queues and delays), and as we’re adopting Serverless solutions across the company, we’ve decided to use AWS Lambdas for running this solution.

The Lambdas will be triggered as a chain, each one focusing on a simple task. Although Step Functions might fit this requirement, we chose to stay with pure async Lambda invocations as we already had robust automation for creating and maintaining Lambdas that supported everything we needed.

Our KS instances in AWS are provisioned with Pulumi, where we already have IOPS/throughput settings for each device in the Pulumi stack. That means the new system would have to bypass that value whenever it runs, which will cause drifts between the Pulumi state and the actual state in AWS.

To avoid such drifts, we decided to set this setting to ignore_changes. This solved another problem — as we’re dealing with a sensitive system we wanted to be able to set a hard low limit for devices, so the Pulumi stack setting will now serve as our low limit for each volume.

The implementation of the chain of Lambdas can be seen in the following diagram:

  1. A few Lambdas collect all the current information from various sources (AWS, DataDog, and Pulumi)
  2. The next Lambda crunches all the data and calculates new values for each volume’s IOPS and throughput. It wasn’t a simple calculation, so we guarded it with maximal thresholds for increasing or decreasing the values. This helped us avoid hitting that once-in-6-hours change limit imposed by AWS
  3. The last Lambda sets the calculated values for the volumes that need to be changed in AWS and sends a metric to Datadog, providing visibility for changes made by this system
System Diagram

To help us detect and resolve issues in this new system, we’ve added a dashboard that visualizes the EBS volume changes that are triggered each day and the status of each change. We’re also monitoring Lambda executions and get alerted if any Lambda fails.

System’s DataDog dashboard

Recent improvements and future plans

The system has been running in production for 10 months now. One notable enhancement was supporting IO2 as well as GP3 — which we had to introduce once some specific instances outgrew the performance GP3 could provide. This required our system to become aware of the disk type, as these two types differ in how their IOPS and throughput values are set.

Along the way, we’ve also found that we rely on data that is not accurate: we query max reads and max writes in two separate queries and summarize those two, resulting in an inaccurate max IOPS value. To solve this, we’ve introduced a custom metric that is sent from the instance itself and combines the read and write data simultaneously.

In the future, we plan to migrate this solution to Step Functions so it will be more reliable and easier to manage.

Conclusions

The solution described above allowed us to successfully and automatically align our volumes’ settings via a Serverless system, running a tunable algorithm that monitors all EBS performance characteristics from Datadog and aligns our volumes; it meets all of our requirements.

As mentioned above, the original plan was to provision the maximum IOPS (16K) of GP3 EBS type for our DB disks and another disk on the frontend, which we knew would be an overkill and would be too costly. Utilizing this system, we can optimize all our disks and reduce the cost dramatically.

To demonstrate the savings, we’ve calculated the average IOPS on the relevant disks and performed two calculations on the AWS Cost Calculator — one with the original “static” 16K IOPS setting and one with the actual average IOPS — and found out that we’re saving $93K a month, or a whopping $1.116M annually.

AWS cost calculation (average IOPS — 4785 against 16k, both at our average throughput of 138 for 1662 volumes) resulting in ~93K$ of savings monthly

Those savings make this project a clear ROI-positive effort: the cost of development combined with the AWS Lambda execution costs is a fraction of the savings.

--

--