Service Resilience — part 2: The (Zero Downtime) Database Switch

Martina Alilovic Rojnic

Published in

ReversingLabs Engineering

8 min readJan 23, 2023

*https://www.bhg.com/thmb/G5SjV9WEISHSPYX3Dc89CT4Pz2g=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/Light-Switches-8db5a1b723ab4c43a576af1627e8e826.jpg*

The challenge

“Spectacular achievement is always preceded by unspectacular preparation.” — Robert H. Schuller

400 services, 100 different tables, and more than 300 TB of data need to be migrated to a completely different database. As we discussed in a previous post, we need to migrate from our in-house built, key-value store solution to ScyllaDB with enterprise support. And we have to do it with no downtime and no data loss. Services using multiple tables — reading, writing, and deleting data, and even using transactions need to go through a fast and seamless switch. And time isn’t our friend. The risk of maintaining an in-house database is rising as our data grows, and it grows every day.

The preparation for the switch lasted almost a whole year. We spent a lot of our time tuning the new database for performance. There were many issues along the way: a failed node here and there, a couple of disk outages, cluster crushes… you name it. It took time to master ScyllaDB, as it would for any new technology that should replace the core of our cloud system. However, these are the issues we’d encounter eventually with any database as our load grew, we were just forced to face them during the migration all at once.

One year still doesn’t seem much, taking the amount of data and services into account… And how did we do it? By focusing on just the stuff we absolutely need — simplicity first. Sometimes, the hardest part is ignoring the shiny features and possibilities and focusing on the must-haves.

Services Migration

“The ability to simplify means to eliminate the unnecessary so that the necessary may speak.” — Hans Hofmann

With services migration, we got lucky, or better said, doing things the right way at the beginning paid off. Changing the database access code of more than 400 microservices would probably take a couple of years to complete, since, unfortunately, ScyllaDB didn’t support our in-house built database client.

Luckily, years ago, when first backend services were written in ReversingLabs and the whole system was designed, team made solid forward-thinking choices:

code for direct communication with the database was a part of a separate library with a simple, generic interface for the services
details of database models were a part of a separate library with a simple interface for the services
allowed queries were restricted to a small set: put, get, get_range, and delete

What that meant was that the code for fetching and writing data inside the service was database agnostic. Furthermore, all the details about the model logic were hidden behind the scenes, in a separate library.

We realized that if we played this smartly, the change we had to do to prepare for the switch in services was upgrading the libraries and a couple of configuration changes (like changing connection string). The code in the services didn’t have to be changed at all! What a relief! And the migration procedure was so simple.

When working with a huge number of microservices, separating the technology from the code is truly the best idea you can have. This is just one of the examples where it saved us a bunch of time and repetitive work.

Data model migration

“The only way to go fast, is to go well.” — Uncle Bob

Changing the code in the library while keeping the model optimal was the main challenge. Can it even be done?

To migrate a model we needed to get to know the new database inside and out. The final solution was based on many tests we had to do.

We were determined to stick with the simplicity first principle. We couldn’t afford to unpack the key value model into columns that ScyllaDB provides out of the box in each of the services. We kept our values packed into Google protobufs and keys as they were, with the same generic names in all databases. We consciously made a decision to use the feature full database as a key-value store and to postpone optimization until after the switch, if necessary. And if not, keep it like this forever.

Our old database was a key-value store that split all the data into a fixed number of enormous partitions based on a key. Of course, first we tested the same model in Scylla (simplicity first principle!). While performing tests, we quickly realized that our new database was not very keen on large partitions. When the partition exceeded a certain threshold, data retrieval would become awfully slow and clusters would start acting funny, failing nodes and stuff. Regardless of that, we still wanted to find a way to migrate the models so that queries in services would remain the same.

We have 2 main types of collections in our system we had to worry about.

The first one is a collection that is meant to be read using a single key to retrieve a single value. As you can imagine, migrating this kind of data access was pretty simple. Instead of using the old partitioning system with a fixed number of partitions, we made the whole key a partitioning key. That way the keys were easily retrievable — and it worked like a charm.

The second type are collections that are meant to be read in ranges using a key and a page start. Here is where the partitioning size problem kicked in.

Some of these collections had keys with a lot of entries inside them. If we had used just the key as a partition, we would have exceeded the calculated stability threshold for partition size. Even though the problem existed just for some collections, we still wanted to have a generic interface for all collections of this type. So we introduced additional sharding to all of them.

What we did was separate what used to be a single primary key into 3 parts: the main key, the sharding key, and the clustering key. The idea was to partition the data using both the main and the sharding key. The main key in this case is the key that is used for querying the data and clustering key is used for paging and sorting the data. With the help of the sharding key, we managed to partition the data additionally using the first part of the clustering key. How big of a part? We made a decision based on each collection, it was a part of its model. For some, it was zero.

Once the models were created, we created collections in our new database using them and we were ready for the data migration.

Data migration

“The only way to eat an elephant is one bite at a time.” — Desmond Tutu

Some of the collections we had to migrate were huge. Dump and load of these databases would take a couple of days and we were aiming for zero downtime. Additionally, the order of streaming data had to be preserved to keep the data consistent. So, we created a pipeline for data migration using Kafka as a helper streaming service.

We created a service that tracked all database updates and continuously sent them to Kafka. Collection by collection, the procedure was the following:

we started the service for streaming collection updates to Kafka (Kafka writer)
we dumped collection data from the old database and loaded it to the new one
we started writers that read data from Kafka and write them to ScyllaDB (ScyllaDB writer)
we waited until accumulated updates were all consumed by our writer service (new updates were still continuously arriving)
we switched services that read from the collection to a new DB following the procedure described above (in Service Migration section) — seamlessly, with no downtime, and no data loss

Eventually, all data and all services that were just reading the data were migrated one by one.

Services that were writing or writing and reading data also had to go through a very simple migration procedure. But since in our microservice architecture services are very intertwined with each other, the migration had to be done all at once.

All the needed preparation has finished and the migration date was set.

First, we stopped the services at the beginning of the pipeline. Then, we waited and monitored. For each service down the pipeline, we waited for the input data queue to empty, as the service consumed the data. We also had to wait for the last Kafka update to be picked up by our ScyllaDB writer service. After that, we would deploy the service with a new database and model libraries — writing to ScyllaDB instead of our old database.

The whole migration lasted a couple of hours before services at the beginning of the pipeline could be started again, writing to a new database.

The (almost) shutdown

“A challenge is a chance to rise to the occasion.” — Unknown

After all data was migrated, and all services were using ScyllaDB, we made sure there were no more reads and writes in our old database. After a safe period of time, our old database data nodes were shut down one by one. All of our APIs and most of our writing services were now highly available and fully dependent just on ScyllaDB.

However, we still weren’t quite done. Our transaction locking mechanism still relied on the master node of our old database. So, some services still had to hold connection to our old database with just that one master node standing, alone and fragile.

Unfortunately, this kind of locking mechanism couldn’t be replaced with Scylla’s lightweight transactions. At least, not directly.

Coming up with a solution for replacing our current transaction mechanism was the biggest challenge we had to face in the whole migration process. We tried different out-of-the-box solutions and none worked for our use case. Once again, we needed something custom. Read about it in part 3.