Large Scale NoSQL Database Migration Under Fire

Published in

AppsFlyer Engineering

9 min readFeb 26, 2018

The following post describes how we migrated a large NoSql database from one vendor to another in production without any downtime or data loss. The following methodology has worked for us and has been proven to be safe and simple. I am sure that there are several techniques for migrating NoSql databases, but here are some guidelines for doing so with minimal risk, while maintaining the ability to rollback at almost any point in the process. This post is based on a talk I gave at DevopsDays in Nov 2017.

In AppsFlyer, we have many NoSql databases. AppsFlyer’s data processing pipeline updates an internal state as it processes the events we get from mobile devices. This state is mission critical and is processed at high throughput.

One of those databases is our app-installs database. This database gets a write hit for roughly every new install that Appsflyer attributes and gets an additional read hit whenever Appsflyer gets any launch or in-app event.
Some numbers (true to the time of writing):

~2000 write IOPS
~180000 read IOPS.
~2 Billion records stored with a replication a factor of 3.

And, yes, it is mission critical.

But the time had come to replace it. It brought us very far and for that it deserves credit. But lately, as traffic and resulting throughput demand has raised, it just has not been able to deliver the stability we need.
Here are a few examples of instability: Scale out went out of control- we were using 45 r3.4xl instances. Daily backups could not keep up and took longer than a day to complete. XDCR stopped working. Losing a node (and adding one back) came at a performance hit and system instability (and when you have so many nodes on AWS the probability that something will go wrong gets very high). We were not happy with the answers we got from the paid support. These were all issues with which we could no longer live.

Migrating data is always hard. It is already hard when dealing with relational databases, and it is of course harder to do so in production, when the database(s) are in use. But doing so in NoSql databases is even more complex. This is because amongst the dozens of NoSql technologies out there, there is nothing in common. There is no common way to serialize or represent data, no standard query language, no schema, and when it comes to out-of-the-box migration tools- you are all alone.

So after some research and trials we chose Aerospike. We had a few months until our enterprise plan with Couchbase was due to expire and we started planning. Since most of the vendors do not encourage data migration out of their database for obvious reasons we were left with two basic options: Write a migration tool ourselves (‘out of band migration’) or use the application to migrate the data for us (‘in-band migration’). In many cases, choosing only one of those methods is sufficient for reasons I will enumerate shortly, but for other cases, you will have to use a combination thereof, as we did.

First phase (migrating inserts and updates):

The basic idea here is to use your application code to duplicate both inserts and updates (‘upserts’) to your new database. In other words, every write or update that is done for the old database is also done on your new database. This effectively starts the migration of all of your new data to the new cluster going forward. In many cases, a delete operation is considered a write here, at least from the application point of view.

This has a few implications:

You are changing your application code. This may impact the application performance, introduce new bugs, and complicate existing code. Take those risks into account.
The data being written into the new database is completely redundant. If (when) you discover a bug, it is okay to just wipe out the entire database and start over.
The data is still being served from the good, old database so the risk of data loss/ corruption is low.

You can start with a small cluster and grow up (scale out) as you go. This is a good opportunity to practice scaling out.

Testing the first phase:
This is the time to do some data sanity tests:

Compare the migrated data between the two databases and verify your expectations. Many databases even have different semantics for records TTL if you use it.

Get to know the new technology:

Look at the logs
Set up automatic backups and restore from them!
Inspect the actual database RAM and disk usage and see that it meets your initial sizing expectations.
Time to wire up your metrics system and configure the proper alerting.
And yes, you guessed it: test your alerts. Remember, after all, this is soon to be your shiny new production database.

Second phase:

This phase is a bit trickier, but still very simple. After you have enough confidence in the new database it is time to start serving from it. The idea here is to serve data from the new database, and if you fail to find something there, fall back to the old database and serve from there.

An important note here is to configure a metric that should count the instances when a record was not found in the new database but was found in the old one. This means that this record was not migrated yet by our application. It also means that if this metric stops rising for a long enough time, the migration is over.

This phase has a few implications:

If your new database crashes, the application code should quickly failover to the old cluster.
Note that the old database is still updated with all data. This means that it is still safe to go back and start from scratch if (when) you find that something went wrong.
This mechanism might have latency and other performance implications. Nothing is for free, you see.
This is the first true combat test of your new database. For safety, if possible, the use of the new database should be deployed gradually. Try to serve any non-VIP data from it first, such as your beta users, or even generated test data.
The important thing is that the birth pains of your new database will affect your less important data.
If you are in a hurry, trigger an explicit migration for every such a miss-and-hit. This means that for every miss-and-hit event, copy the missing record to the new database from the application. This practice should speed up the migration process. Make sure you can support the additional I/O load both on the databases and on the application servers.

Testing time again.

Whatever you tested after the first migration phase, reiterate. The load should be much higher. The database should be larger also, in order to support new read load capacity, etc… If you can afford it, it is highly recommended to leave it working for a few days, or weeks. Traffic may vary, and some errors might appear only after some mileage.

Is that it?

For quite a lot of use cases this is enough. If your records are short lived and/or the application naturally hits every record frequently, you just need to wait it out. Sit back and watch the miss-hit metric diminish to zero. In order to follow the frequency of that metric firing, we used our metering stack, which includes Statsd, Graphite and Grafana. The stack itself does not matter, as long as you can track the frequency of this metric over time.

A record-by record explicit migration:

For more sensitive/complex use cases, you will need to make sure that the entire set of records was migrated to the new database before the sun sets the old one. What we had to do now was to enumerate the entire set of keys from the old database and make sure we migrate the missing ones.

Detour: NoSql database views

Similar to views on relational databases, most NoSql databases introduce some kind of a way to index and query the data. Usually, to query a hash table like data structure they allow the user to create an efficient index over the data. In order to create the index, most APIs let you write some code that will be executed on each key and corresponding record and emit some value which should be the queryable data, eventually. So, for example, assume you have a table in your database that has people’s data in it, you could create an index by a person’s last name simply by writing a small function that extracts from a record the last name. Then, you can query the database for all people that have a specific last name.
It is important to understand here a few points:

The index is an additional data structure that needs to reside somewhere, usually in RAM.
The index may or may not be updated during each new write/delete/update after it is created.

This implies that creating an index might have a non-trivial performance effect on your database. Creating one just for dumping a database of keys, if done recklessly, can easily bring a database on its knees. This is why, although possible, you should never include the entire record in the index because it could effectively duplicate your entire database to RAM.

End of detour.

Couchbase has a nice Javascript API for creating indices on the database. You can write what they call a map function which takes the record and its metadata and emits a key and a value. The key is the index key and the value should be some part of the record itself. You can write arbitrary logic there that takes into account the record and metadata.
We have (ab)used this feature to create an index which has the record ttl as a key and the record id (key) as the data. This gave us a simple way to do a “select *” kind of query on the index which returned the entire set of keys (ids) in the database. As a bonus, Couchbase sorted the set for us by the TTL which, for our case, was useful.

The next phase was to create a simple setup composed of two services: a dumper and a loader. The fact that those two services are decoupled using Kafka as a buffer here achieves two important goals:

Resumability: the option to start and stop each part of the migration process whenever you like,
Independent and controllable scalability. If some part of the migration process needs to be scaled (e.g it is lagging in processing, it creates too much load) just add or remove machines to it.

Algorithm for the dumper:

while there are more keys in the index:
take the next key and any needed record metadata(if required) and publish to Kafka.

Algorithm for the loader:

Consume a key+ ttl from Kafka
query couchbase for the record
convert the ttl into a format appropriate for Aerospike
insert the record to Aerospike (but do not overwrite existing keys)

Basically, the dumper just queried the index and streamed the set of keys + TTLs to an Apache Kafka topic. The loader just consumed the Kafka topic, did a simple transformation on the TTLs to adapt them to Aerospike format, queried Couchbase for each key and loaded it to Aerospike with a CreateOnly policy. This was to make sure that we do not overwrite stuff which had already been migrated by the application.

All that was left was to sit back and watch the miss-hit metric diminish to 0. After a few days, the lights went down on the mighty Couchbase cluster.

Final words

This project took us about 6 weeks end to end. We have made many mistakes during the way and had to restart it several times. One of the most important takings here, that this is a software project. It is not different from any other software projects. It is important that someone will own it end-to-end and that guy should preferably be an engineer who knows the involved applications code well. A dedicated devops engineer will be also probably required, especially when adopting a new DB technology into the software stack. Additional stakeholders should be involved if the taken risks here effect them, too.

To sum it all up:

Above listed a safe methodology for migrating NoSql databases from one technology to another or even between 2 clusters with the same technology.
This process requires time and patience, but is completely doable.
What worked for us is surely not the only way to do it or even necessarily the best one, but we wanted to share it so that others will not be afraid of eternal vendor locking.