SPOF: Deactivating a future problem, now

Emiliano Perez
etermax technology
Published in
7 min readNov 22, 2019

What if your games needed to use a single instance MariaDB database?

A database that you can neither restart, nor perform maintenance operations on. A database with so much data that restoring it if a failure occurs can take up to 6 hours. A database running on a server which hasn’t been updated for years.

How can a Software Engineer work effectively under these parameters? Well, the short answer is they can’t. We need to remove this single point of failure.

After a lot of hard work, we were able to tackle this problem. Now we want to tell you how we managed to transform our Single Point of Failure (SPOF) into a high-availability service. The best part is that we did it without any downtime at all.

First off, a short introduction

We are Etermax, an international company that independently develops social games for mobile platforms. With a team of over 350 employees in our offices in Buenos Aires, Berlin, Montevideo, Sao Paulo and Mexico, we are Latin America’s fastest growing game development company.

Trivia Crack is a world-renowned game, and the brand’s licensed products are available in the top markets. Five years later, Trivia Crack 2 was launched and it quickly climbed to the top of the download charts as well.

This migration process, conceptually simple and straightforward, has a deep business impact given that it’s our biggest question database and it affects these games, among others.

  • 500M+ downloads on mobile app stores
  • 26M+ questions with thousands more added on a daily basis
  • 170B+ questions answered to date, and counting!

We’ve had all kinds of issues with our self-managed database

We were stuck with an ancient AWS virtualization (PV) since we weren’t able to upgrade it. This meant using old generation machines, which get more expensive with time.

The database software maintenance worked on a look but don’t touch basis. A lot of manual procedures plagued our day-to-day operations to avoid reboots.

Since it’s impossible to scale a non-clustered database, the VM supporting it was oversized. This also meant generating manual copies from snapshots before performing intensive operations.

Backups? Yes, we had them, fully automated, but the recovery process was very slow and unhealthy. It could also involve up to 6 hours of downtime on some cases, which was unacceptable to us.

Every journey begins with a single step

We had a lot of work ahead of us, we needed to plan our steps to do it right and on time. Our goal was to get rid of the SPOF, but we couldn’t afford to stop working on our products, adding value to them. We needed to migrate as soon as possible to a SaaS solution.

First things first, we needed to communicate the importance of this task to all teams involved, so we could align efforts and know how to organize and prioritize our work.

One of the first things we did once we started this challenge was to set what we needed in order to have a successful migration. Our constraints for database migration were clear at this point:

  • Zero downtime
  • Avoid developing new code in the applications involved.
  • The old and new databases must be in-sync
  • Consistency checks on both ends

We work primarily on AWS, so we had to choose one of their DBaaS solutions: Aurora or MariaDB within RDS. Then, we needed a tool for the migration process, and we came across Amazon Database Migration Service (DMS). This service had everything we needed: on-going migration, MySQL compatibility, and migrated data validation.

Having decided the technology and migration tools, we established a roadmap to test and verify these new tools before moving to production:

  • Define how to change the database endpoint for the different projects that access the production database without downtime
  • Migrate our development environment to Aurora/RDS to verify that our current use of the database is compatible.
  • Migrate the staging questions database to test a bigger dataset
  • Use DMS to migrate a replica of the questions database
  • Define the migration process for the production database
  • Define a rollback strategy
  • Migrate the production database

We estimated about two weeks for these steps to be completed; however, reality always knows how to mess up your plans.

Set sail, follow the path and… crash

We had to discard the adoption of Amazon Aurora, since our current MariaDB was too old for a direct migration. We deployed a classical Amazon RDS on a Multi-AZ deployment. This was an instant fix for our availability issue.

We had to temporarily remove foreign-key constraints, since they aren’t compatible with DMS. The initial copy stage is not performed in a particular order, which causes inserts on tables with constraints to fail, since the referenced table doesn’t have matching records. Having removed the constraints (not the keys), the initial copy was a success, we would have to add them back at the end.

The on-going stage allowed us to stream changes to the new database in real-time. We hadn’t anticipated though that huge tables would never catch up. Some tables with almost a billion records were stuck on the “pending records” state. DMS couldn’t catch up with the new records added, and thus the validation stage couldn’t start.

We had to purge all non-critical information from the database. We dumped any unused data, and migrated the statistical information to a separate database. Some of these statistical tables were very heavy, so we saved a lot of time on the initial copy as well.

Now, the on-going stage was up to date, but the tables were stuck on the “pending validation” state. At this point, we got stuck for a while. We tried to validate them manually, but the tables were too big to use the mysqldbcompare tool. We also tried enlarging the migration and destination instances as much as we could, with no luck. We had to reach out to Amazon Support regarding this issue, and they helped us get back on track.

Long story short, there was a data validation issue on DMS v3.1.2, we switched to a newer version, and ta-dah, the validation stage started moving. However, even with the biggest machines available, it was excruciatingly slow. The Support Team explained to us that the validation threads were not modifiable… unless we updated the task settings using the Amazon CLI. Learning and implementing this change sped up the process slightly.

We outlined an SLO, time budgets and acceptable error rates for the entire process. To match the time budget, we could only verify the 3 critical tables (which were also some the biggest). Since it’s not possible to disable validation per-table, we had to create a separate DMS task for these tables.

At this point, finally, both migrations were completed. We now needed to point our APIs to the new database endpoint. Since we still wanted to avoid downtime, we began a Blue/Green deployment. The updated APIs needed to coexist with the old ones, adding rows on both ends. This was our first point of no return.

Something to notice about auto-increment keys is that they only go up. So if your auto-increment value is 10, and you insert a row with id=3, your auto-increment value will still be 10. If you insert a row with id=10 or an automatic id, your auto-increment will move up to 11.

With this in mind, we decided to move the auto-increment value on all the RDS tables to generate a gap. We could now receive rows from the origin and insert new values on RDS at the same time, as long as the origin didn’t catch up. We launched the deployment immediately, and deprecated the old APIs gradually.

Here’s what we ended up with:

  • Our MariaDB database is now hosted on RDS, managed by the Amazon team
  • It’s hosted using Multi-AZ, which offers us high availability and zero-downtime updates
  • Automatic backups and synced replicas on-demand
  • Lower costs due to an easier resource management and newer compute families
  • More visibility on the network, query throughput, and I/O usage thanks to CloudWatch
  • Extra scalability by the usage of read replicas, storage expansion up to 64 TB, and the ability to reserve IOPS
  • Expertise on DMS which helped us migrate more databases to either EC2 cloud or RDS, a lot easier

This whole process allowed us to migrate and validate over 200 GiB of raw data, with zero downtime, in less than 48 hours.

However, it actually took us about a month, accounting for all the attempts and issues we had go to through. We hope that our experience can help you save valuable time.

Authors: Victor Rodriguez, Emiliano Perez and Santiago Salvatore

--

--