Our Journey to Aurora Serverless v2: Part 2

Expectations, hope and reality of running Aurora Serverless v2 in production

Published in

Prodigy Engineering

7 min readMay 21, 2024

This is Part 2 of Prodigy’s journey from provisioned to Aurora Serverless v2 DB instances, and it solely focuses on our production DB migration. If you are interested in our investigation and pre-production DB migration, you can read about it in Part 1.

We ended our pre-production migration on a high. There weren’t many challenging obstacles, and the path to production looked promising. However, one thing I like to keep in mind when things go smoothly in pre-production environments is that Magellan named the ocean Pacific (meaning ‘peaceful’) because it was calm and pleasant when he entered its waters.

Reality check

Production started well, with the migration of all of the Tier 2 database clusters going according to plan. There were no negative impacts to latency, and everything was going smoothly. We were still very optimistic going into the migration of Tier 1 databases, with issues that arose being cleared up quickly. Back to school was nearing (which for an EdTech company like Prodigy is peak time), and we needed to make sure everything would operate smoothly.

After our first Tier 1 database migration, we discovered that production databases can operate in mysterious ways!

Things looked ok when comparing the readers on total query duration:

Source: Datadog DBM Query Metrics — First Tier 1 DB reader serverless instance (blue) operating faster compared to its provisioned predecessor (purple), by Author

Source: Datadog DBM Query Metrics — Comparing the readers in more detail, the first row is the old reader, the second is the new one, by Author

However, the writer did not look good:

Source: Datadog DBM Query Metrics — First Tier 1 DB writer serverless instance (blue) running much slower compared to its provisioned predecessor (purple), by Author

Source: Datadog DBM Query Metrics — Comparison on INSERT in numbers, the first row is the new writer and the old one in the second, by Author

Putting it lightly — this was a subpar performance. We kept waiting for the writer to scale up, but that never happened. The algorithm that governs the serverless instances thinks this is fine. After all, there were no errors, the CPU wasn’t running hot, and there were no network spikes on the database, so why scale?

We decided to revert the change and try migrating a different DB cluster.

Source: Datadog DBM — Second Tier 1 DB writer serverless instance (blue) running slower compared to its provisioned predecessor (purple), by Author

Source: Datadog DBM — Comparison on INSERT queries, the first row is the new writer and the old one in the second, by Author

Well, this cluster appears to be slow, too. Needless to say, these latencies cascade down to the services using these databases.

The only way we managed to keep the latency down was to manually scale serverless instances up. Not just a little bit up, they needed to be scaled way above the provisioned counterpart to get the latency to a similar level. At that point, running the cluster on Aurora Serverless v2 was no longer financially responsible.

As a comparison, running a serverless instance at the same scale as the provisioned counterpart costs much more. For example:

db.r5.2xlarge runs with 64 GiB of RAM and costs $0.504/hr

32 ACU equivalent (32 * 2GB = 64GB) costs 32 * $0.12/hr = $3.84/hr

That said, we have clusters that run with 2, 3 or 4 instances, and having them run at such a large scale, even when the traffic is low, is just not acceptable.

The search for answers

After several attempts to figure things out ourselves, we turned to AWS Support. We were initially told that a possible issue is the buffer cache and that it takes time to warm up; however, our data showed that the buffer cache would warm up fairly quickly after the cutover. Yes, smaller instances do have smaller cache available, but regardless of the instance size populating it doesn’t take hours, it’s more like minutes.

Source: Datadog DBM Query Metrics — Cutover from provisioned to serverless cache hit ratio 99.9% provisioned vs 99.7% serverless moments after the switch, by Author

Source: Datadog DBM Query Metrics — Cutover from provisioned to serverless cache hit ratio 96.7 % provisioned vs 94.7% serverless an hour after the switch, by Author

We still had questions: why doesn’t the scaling algorithm kick in after the buffer cache is full and latency is impacted? Scaling would help, but the scaling just doesn’t work.

Ultimately, we managed to talk to the developers who maintain and work on Serverless v2, and we’ve gotten more information about how the algorithm works — in short, it looks at more factors than just CPU, memory and network, but doesn’t factor in the latency. Of course, we only got the gist of it, not the actual secret sauce that runs under the hood. Much to our surprise, the developers were also baffled by the performance we were experiencing. According to them, we were the perfect fit for Serverless v2 with our traffic shape — i.e., we had clear peaks and valleys for which the serverless DB service was designed.

In the end, we didn’t get the answers to the problems we were facing, and with Back to School season fast approaching, we were still running in a partially hybrid state. Some clusters were fully running serverless instances, others had serverless and provisioned instances, and some had to stay provisioned altogether.

Our solution

We were not happy with how the system was running, and after so much time invested in this project, leaving it in a partially integrated state was unacceptable. We had to find a satisfactory solution.

As the old saying goes, “If you want something done right, you have to do it yourself.”

If the serverless clusters don’t want to scale on their own, we’ll make them scale on our terms. So, we wrote a cronjob scaler that would scale clusters multiple times a day throughout the week. Yes, we ended up writing a scaler for an autoscaling service…

So, this is how it started, without the scaling cronjob in place (not all clusters migrated to Serverless v2 yet):

Source: Datadog DBM — First iteration with a handful of clusters on Serverless v2, by Author

And this graph shows how it’s going with all clusters migrated and with scaler cronjob in place:

Source: Datadog DBM — After the full serverless switch in production with RDS scaler cronjob, over the course of a few months, by Author

With our solution in place, we let the scaler cronjob do its thing, making minor adjustments when the holiday season started. The question is, did our hard work pay off in terms of the RDS budget?

Well, kind of:

Source: AWS Cost Explorer for RDS in Production environment after migration, by Author

The monthly spending graph does not have a clear “cliff” after we switched to Serverless v2 like we saw in our pre-production environments, but the monthly averages are still lower than what we saw in previous years.

Journey’s end

The road to getting to the state we are in now was long, at times tedious, but insightful. As a team, we learned a lot about our databases and their quirks and how to take better care of them in the future. During this migration, we also fully integrated Datadog’s Database Monitoring, which helped us visualize issues that would have otherwise gone under the radar. But I digress.

The main point of this initiative was to have auto-scaling databases and reduce our RDS spending. So, how did we do:

Auto-scaling databases? Yes.
Reduced RDS spending? Yes*.

In terms of auto-scaling DBs, we achieved our goal, even though we had to implement our own solution in the production environment.

In terms of savings (“*”), the majority of it happened in our pre-production environments. You can read more about that in Part 1. The savings were inconsistent when it came to production, where the spending is an order of magnitude bigger than in pre-production environments combined. We would save one month, then the other, we are on par with how we were before the migration, then back to saving, and so on. However, significant savings changes in production happened during the holiday season, and we expect to see significant savings during the summer break as well.

We remain optimistic about the future. We expect the AWS engineers will continue to enhance their scaling algorithms in Serverless v2. This could allow us to fully transition away from our in-house scaling solution, having the confidence that our databases can adapt to traffic fluctuations seamlessly.

More importantly, we have (mostly) eliminated the manual scaling work, which used to take us a few days to do when teams requested it. Now, changing the scaling configuration for the few databases that are on scheduled scaling takes about half an hour.

Conclusion

At this point, you may be asking yourself — should I take the effort to migrate off of provisioned instances and onto Serverless v2? As always, the answer is: it depends.

This is a no-brainer for pre-production environments that don’t get as much traffic and are used to load test applications. That means yes, by the way 😅

For production, it depends on how you answer these questions:

How often do you have to manually change your cluster size?
Are you on a multi-year reserved instance savings plan?
Does your application traffic have clear peaks and valleys?

If you are starting and don’t see much traffic on your database side, you would probably benefit from switching to Serverless v2.

If you run a service without peaks and valleys, you are probably better off sticking to the provisioned instances and purchasing a reserved instance savings plan.

Whether you choose to migrate to Serverless v2 or not, I hope that reading this and my previous blog post will help you make the right choice for your use case!

In the end, our journey to running our DB clusters fully on Aurora Serverless v2 was not as difficult as Magellan’s voyage through the Pacific. Still, just as Magellan learned that the peaceful waters of the Pacific could conceal unforeseen hurdles, we, too, discovered that even a promising path could have its own hidden complexities. Yet, through teamwork and innovation, we managed to reach our destination!

Thanks for reading!