Our Journey to Aurora Serverless v2: Part 1

Mastering Cost-Effective and Scalable Databases in Pre-Production Environments

Jovan Katić
Prodigy Engineering

--

Created by Author with DALL-E3

In the tech world, it’s usually easy to forget about finding efficient ways to run our systems because we often get caught up in delivering new features and keeping everything running smoothly. It was about time to find a way to run our database systems without spending a small fortune.

This article is part one of a two-part story about our journey to Aurora Serverless v2 (below, I will call it “Serverless” for short). I will review our decision-making process and the cost-efficiency and auto-scaling benefits we’ve achieved in our pre-production environments — the production migration warrants a story of its own (which I will go over in Part 2).

Why We Needed to $witch?

The way we currently run our database clusters was set up years ago when we did our big migration to RDS Aurora. These clusters were initially configured to provide a decent balance of performance and stability. For years, we wanted to modernize our database cluster management and utilize features like auto-scaling, and now the timing was just right.

It was that time of year when reserved instances plans were about to expire, and we were about to commit to another year of likely over-provisioned resources, so we decided to have another look at AWS’ Aurora Serverless offering.

Evaluation of Aurora Serverless v2

Before jumping into a massive initiative, we needed to decide if it was worth the effort.

Let’s start with a quick look at what Aurora Serverless v2 is.

Aurora Serverless v2 is an on-demand autoscaling configuration for Amazon Aurora. It helps to automate the processes of monitoring the workload and adjusting the capacity for your databases. Capacity is adjusted automatically based on application demand. You’re charged only for the resources that your DB clusters consume. [source]

That sounds enticing!

Prodigy is an education company; therefore, our products and services are primarily used during school hours — meaning we run hot for half of the day and the other half with minimal traffic — this serverless solution sounded perfect for our use case.

Having the ability to run pre-production environments on minimum provisioned size is great for obvious reasons — who actually likes burning money unnecessarily? Serverless would allow us to run lean in pre-production environments most of the time, but if we want to load test something, we get more realistic behaviour from our database.

Some additional selling points were:

  • Simpler capacity management than provisioned — during the start of the holidays or the end of the school year, we had to scale our RDS clusters down manually, then back up, which involved the tedious process of scaling instances down one by one, failing over the writers before scaling them down, too. This maintenance usually had to be performed during off hours, putting additional strain on our team.
  • Better capacity planning for new applications — when shipping a new application, previously, we had to choose capacity for its DB instances arbitrarily, and in the case that our predictions were off, we had to go through the previously mentioned tedious process of manually scaling the instances up or down.
  • Faster, more granular, less disruptive scaling than Aurora Serverless v1 — v1's limitations made us skip it altogether. We evaluated it when it came out, but it didn’t look like a production-ready service.

Discovery and Testing

While this offering piqued our interest, it almost sounded too good to be true. It’s a perfect fit for our traffic shape and the way our services scale. The cost benefits of running at a really low scale were just too good to miss out. We spoke to the RDS team about this migration, and they said that Serverless service is meant for traffic patterns like ours. So, we started testing to see if this thing truly worked as advertised.

To run the tests in a somewhat real-world environment, we chose some of our critical RDS clusters, created clones and crafted some test queries for the applications that regularly interact with it.

One of the cloned clusters was running on r4 instances, Aurora Postgres engine version 12.x and immediately, we noticed the first hurdles:

  • You can only run Serverless on RDS clusters running Aurora Postgres version 13.6 and higher
  • Aurora Postgres version 13 and higher does not support r4 instances
  • The cherry on top, our Terraform module is out of date and as such it does not support Serverless

Great 😐

Add to the task list — upgrade Terraform RDS module and ship it out to all RDS clusters, upgrade Postgres engines to v13.6 or higher on all clusters and replace r4 instances with at least r5 instance family.

Load tests were conducted using hey, a lightweight load testing program. We configured the maximum ACU to 32, and the minimum was set to 0.5 on a DB clone for our auth application.

For those new to RDS Serverless, the unit of measure for Aurora Serverless v2 is the Aurora capacity unit (ACU). Aurora Serverless v2 capacity isn’t tied to the DB instance classes you use for provisioned clusters. Each ACU is a combination of approximately 2 gibibytes (GiB) of memory, corresponding CPU, and networking. [source]

We pointed our auth application to its Serverless DB clone cluster, crafted a query for the application, and sent tens of thousands of requests to see how the DB would react and how this would impact our app’s response times.

The hey command looked something like this:

hey -z 1m -c 450 -H “AUTH: $TOKEN” https://auth-domain.com/ep

This query translates to: for one minute, send requests (defaults to 200 if not specified with -n flag) via 450 concurrent workers with a custom header to the application endpoint.

After execution, the app shows a summary of the run, and in this case, it looked like this:

Summary:
Total: 60.1616 secs
Slowest: 2.4109 secs
Fastest: 0.0556 secs
Average: 0.1105 secs
Requests/sec: 4062.8607
Latency distribution:
90% in 0.1337 secs
95% in 0.1685 secs
99% in 0.2538 secs
Status code distribution:
[200] 244428 responses

With P90 and P95 being well under 200ms, and P99 being just above 250ms during this unrealistically heavy load test that was equivalent to having almost 250 thousand logins within one minute, the results looked extremely promising.

Serverless DB instances were scaling to accommodate the influx of requests, all the way from 0.5 ACU to 32 ACU — that’s from a 1 GB equivalent up to a 64 GB equivalent in a matter of a minute!

The bottleneck appeared to be our pods on Kubernetes, as they started to crash before we could hit the DB cluster scaling limit. That was excellent news! Until now, the DBs were always the choke point, and we would have to scramble and add more reader instances or change their size if we hit an issue.

With tests complete, we decided to give this initiative the green light and jump into the deep end.

Preparation for the migration

As mentioned above, during the discovery phase, we unearthed a few rocks we needed to move before migrating to Serverless v2. We came up with the following game plan:

  • Make necessary changes to the Terraform module we use to support Serverless instances
  • Switch older r4 instances to r5 to support the Postgres engine upgrade
  • Upgrade Postgres engine to latest v13 available.
Created by Author — Getting ready for Aurora Serverless v2

Overall, the module upgrade went well, given the vast amount of changes required. Our main priority during this phase was seamlessly introducing new features without downtime or disruption.

The instance replacement went smoothly. All we had to do was add a new instance of the supported family and remove the old one (if it was a reader) or failover (if it was the writer).

With these changes in place, we were ready to do the disruptive part of the preparation — upgrade the Postgres engine version. This change requires the DBs to be offline for some time, which is fine in pre-production environments but not for production. In order to avoid disruption to our users in production, we planned a few off-hours maintenance windows.

To speed up the process, we have written a script that helped us upgrade all pre-production DB clusters in one go. The script would take the list of RDS clusters and their configurations, then take the pre-upgrade snapshot (always remember to take your backups!), then send an API call to AWS to start the engine upgrade, and finally, after all engine upgrades would complete it would run ANALYZE on the DB(s) so that the performance does not take a hit.

If you forget to run ANALYZE, your databases will struggle to run the simplest of queries, and the CPU load will skyrocket, making the database run so sluggish, users would be better off using carrier pigeons to log into the application than waiting for the database to respond.

ANALYZE collects statistics about the contents of tables in the database, and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries. [source]

For production, we were a bit more cautious, doing a few cluster upgrades at a time, but still managed to do everything in about a week’s time.

And now, the switch.

Switching our pre-production environments

The moment to start switching our database clusters in pre-production to Serverless finally came, and we couldn’t have been more excited.

We had fun writing another script to help us go through this process as much as possible with our hands behind our heads while the automation takes care of everything.

The migration script would go through these steps:

  • It would first take an RDS cluster name and verify its existence
  • Adjust the serverless scaling configuration (this is a fancy way of saying it would set min and max ACU to what we tell it to),
  • Then, it confirms whether the writer is already configured as db.serverless — this was a sanity check, as we don’t have to do anything disruptive (a failover) if the writer was already serverless.
  • Create a serverless instance for each existing instance in the cluster — if the cluster had a writer and three readers, it would create four new serverless instances before proceeding.
  • After it confirms that all new instances are available, it performs a failover operation to the first serverless instance.
  • Finally, after the failover, it asks for confirmation if removing the provisioned instances is safe.

Wrap-up and next steps

Voilà! Mission complete! The pre-production environments were sequentially switched, with a one-week interval for monitoring and testing between each transition. With none of the teams complaining and no operational issues, we were done.

This was a very rewarding experience; it took us a bit to pave the road, but all the hard work paid off once the switch was complete. The cost savings were immediately visible after the switch:

Source: AWS Cost Explorer for RDS in Pre-production (Dev) environment after migration by Author
Source: AWS Cost Explorer for RDS in Pre-production (Staging) environment after migration by Author

The actual results were astonishing. We did expect some cost to be cut, but not this much. This blew our minds. With a cost reduction of about 70% in dev and about 50% in staging environment, we could not have been more hyped about the upcoming migration in production.

Given that there were not many hurdles, we expected that migrating the production environment would be a walk in the park. However, regardless of our preparation, production migrations tend to be a tad more challenging. But that’s a story for Part 2 which will come soon, so stay tuned!

For now, let us celebrate our accomplishments. We’ve majorly cut the operating costs of two of our test environments and have everything ready for production with plenty of time before the “Back To School” peak traffic season starts. We were on the top of the world 🎉

--

--