How We Partitioned Airbnb’s Main Database in Two Weeks
by Willie Yao
“Scaling = replacing all components of a car while driving it at 100mph” – Mike Krieger, Instagram Co-founder @ Airbnb OpenAir 2015
Heading into the 2015 summer travel season, the infrastructure team at Airbnb was hard at work scaling our databases to handle the expected record summer traffic. One particularly impactful project aimed to partition certain tables by application function onto their own database, which typically would require a significant engineering investment in the form of application layer changes, data migration, and robust testing to guarantee data consistency with minimal downtime. In an attempt to save weeks of engineering time, one of our brilliant engineers proposed the intriguing idea of leveraging MySQL replication to do the hard part of guaranteeing data consistency. (This idea is independently listed an explicit use cases of Amazon RDS’s “Read Replica Promotion” functionality.) By tolerating a brief and limited downtime during the database promotion, we were able to perform this operation without writing a single line of bookkeeping or migration code. In this blog post, we will share some of our work and what we learned in the process.
First, some context
We tend to agree with our friends at Asana and Percona that horizontal sharding is bitter medicine, and so we prefer vertical partitions by application function for spreading load and isolating failures. For instance, we have dedicated databases, each running on its own dedicated RDS instance, that map one-to-one to our independent Java and Rails services. However for historical reasons, much of our core application data still live in the original database from when Airbnb was a single monolithic Rails app.
Using a client side query profiler that we built in-house (it’s client side due to the limitations of RDS) to analyze our database access pattern, we discovered that Airbnb’s message inbox feature, which allows guests and hosts to communicate, accounted for nearly 1/3 of the writes on our main database. Furthermore, this write pattern grows linearly with traffic, so partitioning it out would be a particularly big win for the stability of our main database. Since it is an independent application function, we were also confident that all cross-table joins and transactions could be eliminated, so we began prioritizing this project.
In examining our options for this project, two realities influenced our decision making. First, the last time we partitioned a database was three years ago in 2012, so pursuing this operation at our current size was a new challenge for us and we were open to minimizing engineering complexity at the expense of planned downtime. Second, as we entered 2015 with around 130 software engineers, our teams were spread across a large surface area of products–ranging from personalized search, customer service tools, trust and safety, global payments, to reliable mobile apps that assume limited connectivity–leaving only a small fraction of engineering dedicated to infrastructure. With these considerations in mind, we opted to make use of MySQL replication in order to minimize the engineering complexity and investment needed.
The decision to use MySQL’s built-in replication to migrate the data for us meant that we no longer had to build the most challenging pieces to guarantee data consistency ourselves as replication was a proven quantity. We run MySQL on Amazon RDS, so creating new read replicas and promoting a replica to a standalone master is easy. Our setup resembled the following:
We created a new replica (message-master) from our main master database that would serve as the new independent master after its promotion. We then attached a second-tier replica (message-replica) that would serve as the message-master’s replica. The catch is that the promotion process can take several minutes or longer to complete, during which time we have to intentionally fail writes to the relevant tables to maintain data consistency. Given that a site-wide downtime from an overwhelmed database would be much more costly than a localized and controlled message inbox downtime, the team was willing to make this tradeoff to cut weeks of development time. It is worth mentioning that for those who run their own database, replication filters could be used to avoid replicating unrelated tables and potentially reduce the promotion period.
Phase one: preplanning
Moving message inbox tables to a new database could render existing queries with cross-table joins invalid after the migration. Because a database promotion cannot be reverted, the success of this operation depended on our ability to identify all such cases and deprecate them or replace them with in-app joins. Fortunately, our internal query analyzer allowed us to easily identify such queries for most of our main services, and we were able to revoke relevant database permission grants for the remaining services to gain full coverage. One of the architectural tenets that we are working towards at Airbnb is that services should own their own data, which would have greatly simplified the work here. While technically straightforward, this was the most time consuming phase of the project as it required a well-communicated cross-team effort.
Next, we have a very extensive data pipeline that powers both offline data analytics and downstream production services. So the next step in the preplanning was to move all of our relevant pipelines to consume the data exports of message-replica to ensure that we consume the newest data after the promotion. One side effect of our migration plan was that the new database would have the same name as our existing database (not to be confused with the name of our RDS instances, e.g. message-master and message-replica) even though the data will diverge after the promotion. However, this actually allowed us to keep our naming convention consistent in our data pipelines, so we opted not to pursue a database rename.
Lastly, because our main Airbnb Rails app held exclusive write access to these tables, we were able to swap all relevant service traffic to the new message database replica to reduce the complexity of the main operation.
Phase two: the operation
Once all the preplanning work was done, the actual operation was performed as follows:
- Communicate the planned sub-10 minute message inbox downtime with our customer service team. We are very sensitive to the fact that any downtime could leave guests stranded in a foreign country as they try to check-in to their Airbnb, so it was important to keep all relevant functions in the loop and perform the op during the lowest weekly traffic.
- Deploy change for message inbox queries to use the new message database user grants and database connections. At this stage, we still point the writes to the main master while reads go to the message replica, and so this should have no outward impact yet. However we delay this step until the op began because it doubles the connection to main master, so we want this stage to be as brief as possible. Swapping the database host in the next step does not require a deploy as we have configuration tools to update the database host entries in Zookeeper, where they can be discovered by SmartStack.
- Swap all message inbox write traffic to the message master. Because it has not been promoted yet, all writes on the new master fail and we start clocking our downtime. While reads queries will succeed, in practice nearly all of messaging is down during this phase because marking a message as read requires a db write.
- Kill all database connections on the main master with the message database user introduced in step 2. By killing connections directly, as opposed to doing a deploy or cluster restart, we minimize the time it takes to move all writes to the replica that will serve as the new master, a prerequisite for replication to catch up.
- Verify that replication has caught up by inspecting: a) The newest entries in all the message inbox tables on message master and message replica. b) All message connections on the main master are gone. c) New connections on the message master are made
- Promote message master. From our experience, the database is completely down for about 30 seconds during a promotion on RDS and in this time reads on the master fail. However, writes will fail for nearly 4 minutes as it takes about 3.5 minutes before the promotion kicks in after it is initiated.
- Enable Multi-AZ deployment on the newly-promoted message master before the next RDS automated backup window. In addition to improved failover support, Multi-AZ minimizes latency spikes during RDS snapshots and backups.
- Once all the metrics look good and databases stable, drop irrelevant tables on the respective databases. This wrap-up step is important to ensure that no service consumes stale data.
Should the op have failed, we would have reverted the database host entries in Zookeeper and the message inbox functionality would have been restored almost immediately. However, we would have lost any writes that made it to the now-independent message databases. Theoretically it would be possible to backfill to restore the lost messages, but it would be a nontrivial endeavor and confusing for our users. Thus, we robustly tested each of the above steps before pursing the op.
End-to-end, this project took about two weeks to complete and incurred just under 7 1/2 minutes of message inbox downtime and reduced the size of our main database by 20%. Most significantly, this project brought us significant database stability gains by reducing the write queries on our main master database by 33%. These offloaded queries were projected to grow by another 50% in coming months, which would certainly have overwhelmed our main database, so this project bought us valuable time to pursue longer-term database stability and scalability investments.
One surprise: RDS snapshots can significantly elevate latency
According to the RDS documentation:
Unlike Single-AZ deployments, I/O activity is not suspended on your primary during backup for Multi-AZ deployments for the MySQL, Oracle, and PostgreSQL engines, because the backup is taken from the standby. However, note that you may still experience elevated latencies for a few minutes during backups for Multi-AZ deployments.
We generally have Multi-AZ deployment enabled on all master instances of RDS to take full advantage of RDS’s high availability and failover support. During this project, we observed that given a sufficiently heavy database load, the latency experienced during an RDS snapshot even with Multi-AZ deployment can be significant enough to create a backlog of our queries and bring down our database. We were always aware that snapshots lead to increased latency, but prior to this project we had not been aware of the possibility of full downtime from nonlinear increases in latency relative to database load.
This is significant given that RDS snapshots is a core RDS functionality that we depend on for daily automated backups. Previous unbeknownst to us, as the load on our main database increases, so did the likelihood of RDS snapshots causing site instability. Thus in pursuing this project, we realized that it had been more urgent than we initially anticipated.
Acknowledgements: Xinyao Hu led the project while I wrote the initial plan with guidance from Ben Hughes and Sonic Wang. Brian Morearty and Eric Levine helped refactor the code to eliminate cross-table joins. The Production Infrastructure team enjoyed a fun afternoon running the operation.
Checkout more past projects from the Production Infrastructure team:
Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData
Originally published at nerds.airbnb.com on October 6, 2015.