Migrating Airtable to MySQL 8.0

Published in

The Airtable Engineering Blog

16 min readJun 1, 2022

By Doug Forster, Bin Gao, Brian Larson, Martin Michelsen, Doci Mou, Alexander Sorokin, and Andrew Wang

Introduction

Last year, we upgraded Airtable’s MySQL databases from 5.6 to 8.0. A MySQL major version upgrade always requires careful planning, testing, and deployment; this was even harder since we upgraded two major versions in one go. This was a massive undertaking by our infrastructure team, and took the better part of a year to successfully pull this off.

As part of this project, we enhanced our MySQL infrastructure with powerful new testing and operational capabilities, and learned a lot about MySQL internal implementation details along the way. We’re sharing our experiences and learnings in this blog post.

Background

MySQL is Airtable’s primary storage system. We like MySQL since it’s a reliable, high-quality, and stable technology. We run MySQL using AWS RDS, and have made significant investments in our MySQL infrastructure to make it easier to provision, manage, and monitor our databases.

We decided to upgrade to MySQL 8.0 because it contained a number of attractive new features and enhancements over 5.6:

Improved support for online schema changes with ALGORITHM=INSTANT. This allows performing some types of schema changes (including adding a column, our most common type of schema change) as a fast metadata-only operation.
The JSON data type. Historically, we’ve used text columns containing JSON blobs to work around MySQL’s lack of online schema changes, particularly for large tables.
Other new features like common table expressions, window functions, SKIP LOCKED, and more types of indexes, which are situationally useful.
General performance improvements. Oracle reports that MySQL 8.0 is up to 2x faster for some workloads.

Going into this project, we knew it would be a major undertaking. Any significant infrastructural change comes with significant risk, and the fact that we were upgrading our primary storage system raised the stakes even further. It was paramount that we avoid any issues in terms of correctness, performance, or availability that could affect our customer-facing workloads.

Automated and pre-production testing

The first phase of the project focused on running our existing automated test suites against MySQL 8.0. We configured our CI system to run all our tests against both MySQL 5.6 and 8.0 for every pull request. A failure with MySQL 8.0 would issue a warning, but not block merging. We also upgraded some of our non-production test environments.

This testing identified a number of syntactic or compatibility issues which were relatively straightforward to address:

A number of new keywords and reserved words were added since 5.6. One of our tables had a column named grouping, but GROUPING became a reserved word in 8.0.1. We instrumented our code to identify all the affected queries and added backticks to escape the column name.
User management in 8.0 is incompatible with 5.6. This was a problem of our own making since we skipped 5.7 (which acted as a bridge release), and we were trying to replicate directly from a 5.6 primary to a 8.0 replica. We made some progress by limiting ourselves to a shared subset of functionality, but ultimately punted on solving this problem since adding new MySQL users is a relatively rare operation.

During this phase of the project, we also spent significant time building dashboards, monitoring, and operational tooling for provisioning and upgrading MySQL 8 instances.

Testing with production workloads

After a few months of successful automated and pre-production testing, we were confident that there were no basic issues with our application and MySQL 8. However, we still needed to understand how MySQL 8 would perform under a production workload with higher scale and a wider variety of queries.

Our two most important MySQL clusters are the (1) base shards, which store base-scoped data related to Airtable bases like tables and cells, and (2) our main shard, which stores non-base-scoped data like user and workspace information. For historical reasons, we also call base shards application live shards or live shards internally.

It’s fairly safe for us to test new base shards in production. Our internal shard management system lets us control how new bases are assigned to shards, and we can also transfer existing bases between shards in response to load, hardware issues, or other system events. This means we can slowly ramp traffic to a single shard, and if something goes wrong, we can safely transfer the bases elsewhere and decommission the ailing shard.

Using this functionality, we spun up a new base shard running MySQL 8 and transferred a few internal-only bases. After a week of no anomalies (based on our monitoring and user reports), we started assigning new customer bases on the MySQL 8 shard as well.

As traffic started ramping up on our MySQL 8 base shard, we encountered two major problems which we describe below.

Problem 1: Large WHERE IN clauses

We discovered a performance issue where queries with large WHERE IN clauses optimized differently in MySQL 8. We had queries that use WHERE IN clauses with lists of hundreds or even thousands of values. On MySQL 5.6, the query optimizer would use an appropriate index for filtering, but on MySQL 8, the optimizer would instead choose to perform a full table scan. Our theory is that the MySQL 8 range optimizer uses more memory than in 5.6, causing these queries to exceed the range_optimizer_max_mem_size threshold. With this, the fix was simple: we increased range_optimizer_max_mem_size, and our queries were once again using an index.

Problem 2: The tale of the upserts

This upsert issue was the single most difficult bug we found during our upgrade process.

We noticed a significantly higher rate of deadlocks and slow queries on our MySQL 8 shard. This would happen unpredictably for a few minutes at a time, during which the server would become very slow. Most queries, even innocuous-looking ones, would take 30+ seconds to complete. The performance metrics and InnoDB status monitor output (from SHOW ENGINE INNODB STATUS) showed a large number of transactions being rolled back, but it wasn’t clear if this was the root cause or just a symptom of an underlying issue.

We added instrumentation to sample InnoDB monitor output, and analyzed the lock wait graphs implied by the running transactions during the periods of slowness. After investigating multiple dead-ends, we made the critical observation that most of the blocked queries were INSERT … ON DUPLICATE KEY UPDATE queries on tables that had a unique secondary index, and the key conflicts were occurring on the unique secondary index (not the primary key).

Additional digging revealed a MySQL bug report that was fixed in 5.7.26, which changed the locking behavior of INSERT … ON DUPLICATE KEY UPDATE queries. The bug fix reduced the lock scope in the case of key conflicts, which had the side effect of allowing transactions to interleave more closely. This meant a higher likelihood of waits and deadlocks for workloads such as ours.

In our case, the InnoDB monitor output samples showed that the problem was contention on the next-key locks on unique indexes. When certain pathological sequences of deadlocks occurred, some very large transactions would roll back, effectively blocking other transactions while they did so.

A comment by Jakub Lopuszanski on the MySQL bug report outlined some possible workarounds. We considered two of these options: restructuring our application workload to not use AUTO_INCREMENT primary key columns, and reducing the isolation level from REPEATABLE READ to READ COMMITTED.

At Airtable we typically use application-generated random IDs for the primary key, but we still have legacy tables that have an AUTO_INCREMENT primary key (along with a unique index on the application-generated ID). We considered fully migrating off of AUTO_INCREMENT primary keys to solve gap lock contention, but this would have required a lot of application-level changes (and many schema changes), and would still not fully solve the problem for tables with multiple unique indexes.

Instead, we decided to reduce our transaction isolation level from REPEATABLE READ to READ COMMITTED. In READ COMMITTED, most operations do not use gap locks at all, making deadlocks far less likely. The main drawback of switching is that we lose the snapshot isolation provided by REPEATABLE READ.

Fortunately, we were able to exploit a unique property of our base shard workload. Base shards are responsible for storing base-scoped data, and each base’s operations are serialized through a single NodeJS server process. Conceptually, this means we should never have multiple concurrent read or write operations for base data, so we don’t need the stronger snapshot isolation properties of REPEATABLE READ.

To verify our understanding, we implemented a client-side check in our database access layer to generate a warning if we issued a query that was not scoped to a single application (i.e. missing a WHERE applicationId = ? clause). This check caught about ten cases in the wild, none of which were real bugs and only required cosmetic fixes.

After fixing these warnings, we enabled READ COMMITTED for our base shards in all environments, and saw no more upsert-related performance issues.

Record / replay

For our main shard, we had to take a different strategy for production testing. In general, making changes to main is very risky. Since we only have a single main shard, we can’t gradually ramp traffic up and down like we did with the base shards. This means if something goes wrong, it can take down the entire site.

We decided to build a record/replay framework so we could perform offline testing against a snapshot of our main shard with a recorded trace of our real production workload. We briefly considered other options as well:

Synthetic testing. It’s difficult to construct realistic synthetic workloads, and we were interested in the diversity of a real production workload. We also had already run our existing synthetic workloads (our automated test suites) against MySQL 8.
Double writing to an old and new database, then switching reads over to the new database. We decided against this approach because our application makes heavy use of transactions and ACID consistency guarantees. Also, although this scheme provides many nice operational properties, we felt it would be substantially more work than record/replay.

Recording and replaying real SQL traffic is a fascinating exercise because it’s fundamentally unsound. Slight perturbations and imprecision during either recording or replaying can result in different inter-arrival times or ordering of queries. This means the DB state will be different, the exact time when the queries hit the server will be different, and the order in which concurrent queries get locks will be different. Some inserts that worked in real life may fail in the replay and subsequent updates will then be no-ops.

Despite these limitations, we still expected record/replay to reveal any major issues with our main workload. Based on our experience with the base shards, we felt that we were already aware of the problematic query patterns (large WHERE IN clauses, concurrent upserts), and knew to keep an eye on metrics like query latency and error rate.

We have a few options for recording the traffic:

MySQL
ProxySQL
Application layer

Recording at the MySQL level (e.g. via the slow query logger) is the most accurate since it captures every query that runs against the database. However, the downside is that it adds significant load to the database instance. Recording at the ProxySQL layer is also very accurate, but it lacks application-level information about transactions and connection state that were important for replay fidelity. ProxySQL changes are also difficult for us to incrementally deploy and test. Finally, recording at the application layer would add extra overhead to running queries, but it’s a lot easier to test and roll out gradually.

Ultimately, we landed on recording at the application layer because of the ease of testing and gradual rollout. We augmented our MySQL client wrapper to log queries to local disk. Then, we shipped the query logs to S3 with logrotate.

Our application does a lot more reads than writes, but most of the reads are simple joins and point queries. We decided that we’d be okay with recording 100% of writes and just 10% of reads (to reduce overhead). We slowly ramped up recording over the course of a few weeks. We watched our server-side and application performance metrics during the rollout, and didn’t see any noticeable overhead.

To replay the traffic, we developed a simple replay client that could read a recorded query log and replay the queries against a snapshot of the main database. To drive enough load, we used Kubernetes to run multiple concurrent replay clients at the same time. The entire replay process was automated from end-to-end: creating a new MySQL instance from a snapshot, downloading query logs to EFS, and running and coordinating the replay clients on Kubernetes.

The results from our full-scale replay test were illuminating:

We saw a small number of write errors (less than 1 in a million). After debugging, we determined these were artifacts introduced by the replay process (e.g. an INSERT followed by an UPDATE for the same record was recorded, but during replay the INSERT statement might take longer to finish, causing the UPDATE statement to be replayed before the INSERT statement finished ).
We found and fixed a few more instances of the issue with large WHERE IN clauses we first encountered on the base shards. We increased the value of range_optimizer_max_mem_size on our main shard as well, and also took the opportunity to paginate some of these queries into smaller batches.
We also identified a few other inefficient queries during our performance investigation and collaborated with our application developers to optimize them. These optimizations included forcing the use of a specific index, making expensive queries less often, and restructuring queries to scan less data.

Throughout, the replay tooling made it easy to reproduce the above issues and also verify our fixes. We also added support for replaying at faster speeds, and load tested our main database with 3x and 5x speed replays.

While record/replay had its limitations, it still proved extremely useful. We identified and fixed real problems with inefficient query plans, and gained a lot of confidence that our write workload would execute correctly.

Downgrade

No infrastructure upgrade initiative would be complete without a backup plan. Even with all of our testing and validation efforts, we still wanted the ability to safely downgrade main to MySQL 5.6 in the event of a worst-case scenario.

MySQL replication chains before and after the upgrade

Our high-level plan was to set up a replication topology like the above figure. We wanted to replicate from our current MySQL 5.6 primary [A], to the new MySQL 8 replica [B], and then to the MySQL 5.6 fallback replica [C]. We then wanted to run consistency checks to make sure that the database state on all three hosts was identical. On upgrade, we would promote the MySQL 8 replica [B] to being the primary. To downgrade, we would promote the MySQL 5.6 fallback replica [C].

This sounds simple, but MySQL only supports replication between consecutive major versions, and only from lower to higher versions. We needed to overcome both of these limitations.

To get the lay of the land, we started by trying to configure this replication topology in RDS. RDS (correctly) recognized this configuration as unsupported, but we were able to do this manually by using mysql.rds_set_external_master. After switching to row-based replication, upward replication from MySQL 5.6 to MySQL 8.0 seemed to work, but downward replication back to 5.6 did not.

We started researching and found a helpful Percona blog post on replicating from MySQL 8.0 to MySQL 5.7. MySQL 8.0 uses a new default character set and collation that are not supported by MySQL 5.6. We updated the parameters on our MySQL 8 instance to match the MySQL 5.6 defaults per the table below:

This resolved most of our errors, but we would still intermittently see issues. After additional debugging, we realized that they were related to some RDS-specific tables used for internal bookkeeping (mysql.rds_sysinfo, mysql.rds_configuration, mysql.rds_replication_status, mysql.rds_history, mysql.rds_heartbeat2, and mysql.rds_history). We weren’t able to find a way to reconfigure these RDS tables, so we wrote a script to automatically skip the offending binlog entries using this sequence of operations:

CALL MySQL.rds_stop_replication
CALL MySQL.rds_reset_external_master
CALL MySQL.rds_set_external_master(…)
CALL MySQL.rds_start_replication

All together, this allowed us to smoothly replicate from A, to B, to C, in near real-time.

After achieving reliable replication to database C, our next concern was to verify that the data was consistent with upstream databases. To answer this question, we wrote another script which a) paused replication on database A b) waited for database C to catch up and c) ran MySQL’s CHECKSUM TABLE on each table to verify that data was successfully replicated from MySQL 8 to 5.6. This gave us confidence that if we needed to fallback to database C, there would be no data loss or corruption.

The ability to downgrade back to MySQL 5.6 if needed gave us the confidence to proceed with the upgrade to MySQL 8.

Upgrade procedure

Conceptually, the upgrade process is simple:

Provision a new MySQL replica running 8.0
Establish replication from the 5.6 primary to the 8.0 replica
Failover from the old primary to the replica

The first two steps are relatively safe. They can be performed at any time (or even rolled back) without affecting production traffic. The last step, failing over to the new primary, is where all the danger lies.

The first problem we considered was the dreaded split-brain scenario, where an error during failover leads to both MySQL instances thinking they are the primary. As the two instances accept client writes their state will diverge, leading to data inconsistencies. Fixing a split brain is a costly manual process, so we wanted assurances that an error midway through our failover process would not lead to split brain.

One way to prevent split brain is through fencing: preventing writes to the old master. A typical fencing method is using MySQL’s read_only parameter to change the old primary to read-only mode before changing the new primary to read-write. As an extra level of assurance, we also used an “isolation” AWS VPC security group to block network traffic from our application, and orchestrated it in the same way as the MySQL read_only parameter: disable traffic to the old primary, enable traffic to the new primary, then perform the DNS switchover. Security group changes propagate quickly, so this did not add any additional time to our failover procedure.

Fencing by setting the old master to read-only and blocking network traffic

The second problem we considered was an error during the failover procedure leaving us with no available primary. For instance, if we fail to enable traffic to the new primary, that leaves us with no reachable database instance and the entire site would go down.

To ameliorate this concern, we developed a series of scripts to fully automate the entire upgrade procedure. This let us extensively test the upgrade in non-production environments, which helped us gain confidence. We also condensed all of the steps of the failover process into a single script. This was an important operational simplification, since it meant no manual copy-pasting of multiple commands or switching between multiple terminal windows.

As a final check, we also asked ourselves: even with all of our preparations and testing, what else could go wrong? We decided the biggest remaining risk was hitting a new, previously-unknown performance problem after going live with the upgrade. To address this, we prepared a series of techniques for shedding database load:

We built the ability to dynamically throttle different actions that our customers may be taking.
We built the ability to force specific transactions to use READ COMMITTED in case some workload unexpectedly runs into the upsert issue described above.
We tuned our policy for killing long running queries.
We prepared a list of heavy user workloads that could be disabled if necessary.

Day of the upgrade

All of our preparation culminated in the day of the upgrade. It was certainly high stakes: the MySQL 8 upgrade was by far the largest big-bang infrastructure change in the history of Airtable. As the procedure was set in stone docs, there wasn’t actually anything to do until upgrade time T, which we scheduled for after-hours to minimize potential customer impact.

As an additional precaution, all the engineers participating in the upgrade started working late in the day, with the unofficial recommendation to take a nap in the afternoon. It sounds a bit silly, but we wanted the entire team rested and ready in the case of an emergency.

At time T — 2 hours we started up all necessary connections and prepared for upgrade.

At time T — 1 hour we performed replication checks

At time T — 30 minutes we assembled the upgrade team on a Zoom, and verified connectivity to our production environment

At time T — 15 minutes we disabled background processing services

At time T we executed the prescribed upgrade command and waited. The tenseness in the room was palpable. Early signs were positive: the site loaded, we were able to load bases, and no massive wave of errors flooding our monitoring systems. So far, so good.

After that, we started bringing up our background services. This is where we hit our first snag: one of our scheduled cron jobs was timing out. We re-disabled it and dug in. We realized that it was running some large scan queries which were normally served out of cache, but our new primary was still cold and warming up its caches. After waiting a bit, we were able to re-enable the cron job and let it resume processing.

At T+2 hours with no further errors, we adjourned for the night. We continued to actively monitor site health over the next few days. After two weeks with no further problems, we officially declared the upgrade a success.

Conclusion

Our journey from MySQL 5.6 to 8.0 was truly a saga. It took the combined efforts of engineers from multiple teams to pull off the project. We had to overcome a number of deep technical challenges, and also develop a number of new testing and operational capabilities related to our MySQL infrastructure.

One of the big takeaways from the experience is that all of our time spent on planning, preparation, and testing really paid off. As an example, having a safe downgrade capability (even though we didn’t use it) was absolutely essential to making the go/no-go decision on the upgrade. The record/replay framework we developed is also already paying dividends, as we’ve used it to test a number of subsequent MySQL infrastructural changes. Additionally, the loadshedding techniques and runbooks we developed for the day of upgrade have proven useful during other site incidents.

On a final note, it’s extremely gratifying that the end result matched the effort we put in. Despite all of our preparation, there’s always the possibility that something goes sideways on the day of a big upgrade. We were mentally prepared for a variety of potential outcomes, and what we got was pretty close to the best case scenario.

If these problems are interesting to you, Airtable’s Storage team is hiring! We’re looking for both Database Reliability Engineers and Software Engineers to help improve the reliability, scalability, and performance of our storage infrastructure.

Acknowledgements

In addition to the authors of this blog post, special thanks go out to Greg Leclercq, Jayden Navarro, Emmett Nicholas, Matt Steinfeld, and Keyhan Vakil who also contributed to the upgrade effort.