How Wise reduced AWS RDS maintenance downtimes from 10 minutes to 100 milliseconds

Kristo Kuuskull

Published in

Wise Engineering

12 min readDec 2, 2021

Terms used in this article

Error window
A time period, when the the service is returning errors due to the database being unavailable. Used as a synonym for downtime.
Primary
Current database instance the service is using and we are switching over from.
Secondary
A database instance which will become the new primary after switchover. Synonym for replica.
Split brain
The service is writing to both primary and secondary at the same time, resulting in data loss or lots of hours from DBAs to fix it manually.

The problem

In Wise, we have followed a microservices architecture, where we have about 500 microservices, but also 300 production databases. Half of those are on MariaDb and the other half on Postgres.

The decision of splitting all our data into 300 databases has been a really good one, because it allows us to use a simple primary ➡ replica topology on all of those, without running into soft and hard limits on MariaDb and Postgres. Databases with a single-node primary are still the most cost efficient solution to run database workloads on, allowing us to keep our customers’ fees lower, than with let’s say “newsql” databases.

For the sake of simplicity and clarity, this article will only focus on MariaDb, though we have the same problems and a solution for Postgres as well.

Needless to say, having 300+ databases, we need a platform, which allows us to handle those as cattle and not as pets. We have chosen Multi A/Z AWS RDS to be that platform. So far we have been satisfied with it, but as usual, everything has its problems.

One of those problems is that maintenance operations, for example rightsizing or upgrades, do create error-windows. Those can be much longer than our customers can tolerate. A major upgrade can easily create a 10 minute error-window.

We try to keep microservices independent and have fallbacks, but not everything is perfect and thus there are many direct interdependencies still in present. Our main critical customer-facing flow is depending on 20+ microservices being available at that time. This interdependence amplifies the maintenance issue greatly. If one database is down for 10 minutes, then that critical flow is unavailable for 10 minutes. Upgrading all those databases at different times, you could end up with 200 minutes of errors for those customers.

Of course, we try to upgrade all those databases at the same time. But due to the limited amount of DBAs, needing to coordinate with the service teams and the fact that AWS operations don’t actually execute at predictable times, this 20-database upgrade procedure is usually accomplished in 1–1.5 hours.

1.5 hours of downtime for customers, in 2021, as a global business, is just unacceptable and inexcusable.

So we put together a team of SREs as microservices specialists, and DBAs as database experts; to figure out how we can solve it. The working group had a very clear goal — bring downtime to less than 1 second, without moving to more costly platforms.

You will see below, why it was crucial to have microservices and database experts in that working group — the solution was a cross-discipline one, which no single expertise group would have come out with.

But let’s see first, how we did the so-called “minimal downtime” major upgrades so far.

Current way for doing major upgrades

The RDS inplace major upgrade was out of the question, as this one can take quite a long time on larger databases, like some of the 5TB ones we have.

First we needed a new database instance to switch over to.

Created an RDS read replica for the primary.
Upgraded the read replica to a new major version.
Warmed up the secondary by running OPTIMIZE on all tables, to get all S3 blocks (a long story) onto RDS disks.
Waited until it fully caught up to the primary.

After that, the service team created a new version of the service, pointing all database connections to the replica.

A switch meeting was then scheduled, having service team members and DBAs present, to finally do the switch.

Stop the service completely, i.e. all of it’s kubernetes pods.
It is needed because the AWS promotion stops all replication and any additional changes in the primary will now get into the replica.
Our services go down gracefully, so this step usually takes 1 minute.
Error-windows starts.
DBAs wait until connections to the primary have disappeared and then lock the service users, for example by changing their passwords.
DBAs execute the AWS RDS “promote replica” command.
This typically takes around 5 minutes, sometimes 10 minutes or even more on very large databases. The main time sink is that RDS needs to create a new snapshot for the replica on which the point-in-time-recovery will depend.
Service team releases a new version pointing to the secondary.
Spring Boot services can take some time to fully initialize, so this step can take 30 seconds to 1 minute as well.
Error-window ends.

Not ideal by far. So let’s look at what we came up with.

New switchover system

We realized that we needed to make the service smarter, as we can not rely on releasing new code nor configuration during the switchover.

We also realized that we can’t use AWS RDS replica promotion during the switchover, as it is just too slow. So it became clear that we need to have another RDS instance, which is not tightly coupled to the primary, ready before the switchover starts.

But at the same time we had to make sure that whatever we come up with, the data is protected. Any kind of split-brain scenario has to be avoided at all costs, up to the point that you would need multiple mistakes in both database and service side for that to be possible.

Let’s look at the service side first.

Service side

We created a wrapper library around HikariCP, which is able to switch the database url pointing to the primary to the secondary at exact time. But how does it know what that exact time is?

For that we have a custom metadata table in both the primary and the secondary, which contains the host a service should connect to. Initially that value is the primary’s host of course.

The library is polling those tables constantly, and as soon as it sees a new host in those, it will change the database url(s) and evict all connections from the pool(s).

So an extremely KISS system on the service side. But the implementation of course has its complexities, covering all different corner cases, applying safety measures, supporting tons of different Spring Boot versions and so on.

Now let’s check how to create a secondary database instance to switch over to.

Creating and preparing the secondary

There are many ways to do it, including:

Via classical mysqldump
RDS read replica
Launching a secondary from latest primary’s snapshot and enabling replication from “select @@gtid_binlog_state”

We successfully tested all those variants, but prefer to use the “RDS read replica”, because this one usually has the smallest replication lag after launching.

So, first we create an RDS read replica and then prepare it for the switchover.

We stop the replication on the secondary via “CALL mysql.rds_stop_replication”.
We note down the secondary’s replicated binary log position with “SHOW SLAVE STATUS”, from “Master_Log_File” and “Exec_Master_Log_Pos”.
We will promote the secondary via the AWS “promote_read_replica” API.
This step can take some time, but we need to wait for it to fully finish to continue.
We are resuming the replication from where it was left off.
For that, we execute “CALL mysql.rds_set_external_master”, followed by “CALL mysql.rds_start_replication”.

At this point, we have two independent database instances using classical native MariaDb replication — primary and replica.

We can now do any necessary maintenance operations on the secondary, for example, a major version upgrade. MariaDb supports replication between major versions, so the data flow will happily continue.

Behind the scenes, when we create a RDS read replica, it will essentially be launched from an EBS snapshot, which uses the S3 lazy loading system, meaning that not all the blocks are on the database disk, i.e. the database is cold. Switching over to it would create huge latency spikes and result in errors for customers.

We want to make sure that before we are actually switching over, then all the blocks get onto the disk. For that we run “COUNT(*)” or “OPTIMIZE” on every table.

Now we can make sure that the routing metadata system is set up on the database side, for that we create the custom metadata table and a separate user to be able to SELECT from it. The connection host value in that table points to the primary.

It is very important to avoid any split brain scenario, where service is writing to both primary and secondary at the same time.

So we need to protect the secondary against changes as follows, before the service side enables the switchover system on its side.

Revoke all privileges from all service users.
Set the database as read-only.
This is mostly for removing a human error scenario, where DBAs are accidentally performing modifications to/on the wrong database instance, before the switchover actually happens.

When the secondary has been prepared and the service side has the library incorporated, we are ready for the switchover.

Of course, all of the steps to create and prepare the secondary have to be fully automated, as some steps take lots of time and otherwise human errors are possible. For example we can’t have a DBA copy-pasting wrong binlog position to continue replication after promotion.

The switchover

At this point, we have:

prepared and protected the secondary
a service with a special library with the switchover system enabled

The switchover is happening via an automated script where the main steps are as follows.

First the script is validating the state, we need to be absolutely sure that the switchover will be successful. There are about 40 different validation rules covered and we most likely will add more over time. For example:

Is secondary closed for writers?
Is replication ok?
Is replication lag close to zero?
Is the service side working properly, are we seeing it polling the metadata table?
Are there any unexpected users or sessions in both databases?

If all is fine, then the actual fun starts.

Make the secondary crash safe and open it for writes in general.
Often to allow the secondary catch up faster, we set ‘sync_binlog=0’ and ‘innodb_flush_log_at_trx_commit=0’ on it, but of course we can not risk switching over like that.

So, the script does RDS ‘modify_db_parameter_group’ API call with ‘read_only=0, sync_binlog=1, innodb_flush_log_at_trx_commit=1’.

The ‘modify_db_parameter_group’ API call can take a long time to execute, sometimes even 50 seconds. But fortunately this is not starting the error-window.

Notice that, even when we turned off the read-only flag, we are still protected against split brain, because service users have no privileges on the secondary.
Waitloop until we see on the secondary that ‘read_only=0, sync_binlog=1, innodb_flush_log_at_trx_commit=1’.
Revoke all grants from service users, on the primary.

Here the error-window starts.
Kill all services’ connections on the primary.
Database user privileges changes do not apply to existing connections, so in order to stop all the writes to the primary, we need to kill existing connections.
We run “CALL mysql.rds_kill” for every service connection found in the ‘information_schema.processlist’ view. Some of our databases have hundreds of connections, so to fight against latency we are doing it in multiple threads and connections.
Waitloop until we see no more services’ sessions in ‘information_schema.processlist’ view.
Just in case, to protect against some bugs or changes in future MariaDb versions, we also wait until the ‘information_schema.innodb_trx’ view is empty.
Waitloop until the secondary has fully caught up with the primary.
For that we query the primary’s last applied GTID and check whether it has been applied to secondary as well.
And just in case, check replication lag seconds and overall replication state on the secondary as well.
Set all those grants we revoked on the primary in step 3 from the secondary’s users.

From this point on we risk getting writes to the secondary and can not roll back to the primary anymore.
Change the connection host value in the secondary’s metadata table to the secondary’s host.
Change the connection host value in the primary’s metadata table to point to the secondary host.

Here, the service side, polling that table, understands that it needs to point all connections to the new host and that is what it does. Usually inside a 10ms window, but the polling can be configured to be even more aggressive.

The error-window now stops.

We can stop here, but we execute some additional safety measures.
Stopping replication on the secondary.
We could have done it earlier, but it is a bit of a slow command and we did not want nor did we need to prolong the error-window with that.
Setting the primary as read-only.

Now we could also stop the primary, but we do have some other related things like ETL switchover to take care of.

With this system, even when there are quite as many steps above, we are seeing the error-window being on average “only” 100 ms long. Sometimes even as short as 50 ms.

This kind of error window length already opens other possibilities like making the service completely capable of handling it without any errors.

For example, Java’s garbage collection often creates 100 ms stop-the-world pauses and there is no error-window created due to that.

The error window length depends mainly on the following factors.

How heavy writes the primary has, i.e. how long replication lag spikes it can generate.

Also really large transactions can have a negative effect, but in Wise, we try hard to keep those small for customer facing services.

For some databases, just to mitigate risks, it makes sense to do the switchover at a quiet time of the week.
How many connections we have to kill.
We have to send each process’ kill command over the wire as a separate statement, so this is mainly affected by the latency our switchover script and primary database has.

For this reason, it is beneficial to run the switchover script in the same availability zone as the primary database.
We hope that AWS implements support for MariaDb’s “KILL USER <user>” statement, which allows it to have only one wire call for each service user. For example AWS could implement a ‘mysql.rds_kill_user(user)’ procedure, next to the already existing ‘mysql.rds_kill(thread_id)’ one.

New opportunities

Lowering the error window below 100 ms, is definitely a big level-up for our databases and opens some nice opportunities.

Less disruptive upgrades.
For both major and minor versions.
More rightsizings.
As instance type changes so far have created painful 1–4 minute error windows, we have been forced to over-provision our databases’ compute quite a lot, so they have enough buffers for a year’s growth and any other unexpected scenarios.
Often we pay 2–4 times more for a database compute than we should.
By eliminating the need for large over-provisionings, we will be able to reflect that in reduced fees for our customers.
Less need for working on weekends or late nights, especially for DBAs.
No need for service teams to be present during the switchovers.
It saves a lot of working hours. It is not even about the time spent during the switchover but mostly preparing and organizing everyone to be present at that time.
Leave all those 1–1.5 hour site-wide downtime maintenance windows in the past.
We already cancelled a full site maintenance which was planned on the 28th of November, as we now have the new system available.

Future improvements

In the future, we would like to make those switchovers even less disruptive.

For that, we would like AWS to implement an RDS procedure for changing the MariaDb ‘read_only’ flag instantly. So instead of revoking service users’ privileges and killing their connections, we can just set the primary to read-only mode and that’s it.

The MariaDb read-only flag applies instantly even to all ongoing transactions and they will be safely rolled back. Currently the read-only flag can only be flipped through an AWS RDS parameter group change, which sometimes takes as long as 50 seconds.

Of course if we allow read-only workloads on the primary, during the switchover, the service has to be tolerant for some read-after-write inconsistencies, but in many cases this can be achieved.

Why service libraries and not proxies?

We alread had lots of magic on the service side via our ‘tw-reliable-jdbc’ library, which creates a protection and fault tolerance layer above databases. For example, setting dynamic query timeout for each query based on an end-to-end deadline.

It was trivial to add the switchover logic into it as well and get it upgraded for services needing switchover.

Also, our current understanding is that service side solution minimizes the error-window and allows more control and smart logic in the future to have no impact on switchovers at all.

But we have been talking about trying out some proxy solution in the future, for example using Envoy or writing something of our own. It is clear that having the whole switchover logic outside of the service has some appealing benefits as well.

However, for example some of our largest databases, outside of RDS, on EC2, use an Envoy + Orchestrator setup, but in those cases we have 5–10 second error windows.

Summary

Wise can now do database major upgrades and other maintenance operations within a 100 ms error-window, down from 10 minutes.

Overall, working out this system has been quite a lot of fun and showed us clearly why interdisciplinary coordination is very important.