Expedia Group Technology — Software

Switching Production DB Seamlessly from RDS to Aurora MySQL with TBs of Data

How & why we switched our production database to AWS Aurora

Rinky Goyal
Expedia Group Technology

--

real light bulb on chalk board with lines emanating from it to ellipses
Photo by Pixabay on Pexels

Trips Platform is the backend stack that powers the shiny post booking trips experience on Expedia® for both mobile app and website. This system processes all the bookings made on Expedia. As you can imagine, it processes numerous bookings every day. The platform is read & write heavy where reads are more than double the number of writes. We were using 4 Amazon RDS MySQL databases. We ran a spike and found Amazon Aurora to be better suited to our needs in many aspects.

Why Switch to AWS Aurora?

rocket blasting through sky
Photo by Bill Jelen on Unsplash

Improved Performance

Amazon Aurora provides 5x the throughput of standard MySQL

Improved Replication

With Aurora, you can provision up to 15 replicas, and replication is performed in milliseconds. By contrast, RDS allows only 5 replicas and the replication process is slower.

High Availability

Amazon Aurora automatically maintains 6 copies of data across 3 Availability Zones (AZs) and will automatically attempt to recover the database in a healthy AZ with no data loss.

These are just a few of the features provided by Aurora. More details can be found here.

Improvements & Impacts after Moving to Aurora

Performance

  1. On some DB queries, we saw 3–5x improvements in DB response time. In certain cases with RDS, there were spikes ranging in seconds which no longer happen with Aurora. Most calls are under ~200ms (Avg TP99) when moved to Aurora.
Image showing that on some DB queries, we saw 3–5x improvements in DB response time. In certain cases with RDS, there were spikes ranging in seconds which no longer happen with Aurora. Most calls are now under ~200ms (Avg TP99) when moved to Aurora.
Performance Improvement after the switch to Aurora

2. When wrapping a stored procedure in a transaction, many deadlocks & high response times (TP99 of up to 30 seconds & TP95 of up to 4 seconds) were seen on RDS. On Aurora, the same change had little to no impact on deadlocks & DB response time of TP99 stayed at under 1 second & TP95 under 50ms.

Image showing that when wrapping a stored procedure in a transaction, many deadlocks & high response times (TP99 of up to 30 seconds & TP95 of up to 4 seconds) were seen on RDS. On Aurora, the same change had little to no impact on deadlocks & DB response time of TP99 stayed at under 1 second & TP95 under 50ms.
TP95 & TP99 difference between RDS & MySQL Aurora

Cost

Tall stack of currency coins
Photo by Pixabay on Pexels
  1. IOPS/Storage
    Provisioned IOPS in RDS cost way more than on demand costing in Aurora. This helped us save 5x the spend on RDS based on our current usage.
  2. Instance Cost
    Even after over provisioning instances in Aurora, we’re saving 25% of the overall cost spent on RDS with better performance, throughput and resiliency

Do all these enhancements sound interesting to you? If yes, please go on reading further down this article to know how we did this.

Traveler with suitcase stepping into open country side
Photo by Pixabay from Pexels

How did we Achieve this?

Existing DB statistics

  • 4 MySQL RDS databases with provisioned IOPS
  • Total data to migrate was 1+ TB

Stress testing

Before going ahead with the effort of switching databases, we wanted to validate the expected performance improvements. We did stress testing and indeed saw performance improvements with mean response time dropping by more than 50%.

Flyway

Our team uses Flyway to perform & manage DB changes (like updating or adding a stored procedure, editing table schemas, etc.) to enable us to have finer control & knowledge over all changes going into the DB. We already had all our database scripts for schema creation and stored procedures in these Flyway SQL scripts which allowed us to easily recreate our DB schema in an Aurora cluster.

AWS Database Migration Service

  • AWS DMS allows to transfer data between almost any combination of databases. We used this to migrate TBs worth of data from RDS to a new Aurora cluster. DMS supports full load i.e. to copy existing data to newer DB plus it also allows replicating on-going incremental changes. This was very useful since this allowed us to have minimal production downtime since the Aurora cluster had the latest changes replicating in real time. The DMS replication lag was of the order of few seconds.
  • Another useful thing in DMS is that it automatically validates the data between two data sources. This was very helpful since it helped us track down few issues like data encoding issues as encoding was set-up differently between our two databases.
  • There are a few tables created by DMS tasks which help tracking & debugging any errors/failures that might happen while migrating data.
  • Sample table stats via DMS to monitor are shown below:
Image showing sample table stats via DMS which can be monitored to know the progress of migration
Table statistics from DMS
  • DMS provides detailed dashboards to see if something is going wrong like if replication latency is high etc. We used it to our advantage and were able to uncover & fix a few hiccups.
Image showing DMS dashboard with various graphs like CDC latency at source & target, CDC incoming changes
DMS dashboard

Additional automated validation testing

To give us even more confidence that data was correctly transferred, we also added validation testing which automatically compared data between our old RDS instance and new Aurora Cluster.

Monitoring dashboards & alerts

Prior to starting the actual DB migration, we added granular Grafana metrics and dashboards to give us a detailed picture of our apps and databases and also added alerts if something goes wrong.

What if Something Went Wrong When Switching Traffic in Production?​

Things go wrong when we least expect them to

  • We had a rollback strategy in place in case things went wrong. Since this platform is a Kafka based event driven system, the easiest way to rollback was to do a Kafka offset reset for the writer apps in the ecosystem.
  • AWS DMS provides good features like bidirectional replication to make the rollback a bit easier. We decided not to use it since it had some limitations & it didn’t serve our use case. We highly recommend evaluating this and other strategies listed here when preparing for rollback.

Minimal production down-time

We use Kubernetes to manage our deployments in different environments. We deployed a parallel stack running on Aurora cluster but not taking any live traffic. This allowed us to:

  • Quickly switch read traffic to the new stack and minimise downtime for the reads/writes during the maintenance window.
  • Reduce the amount of manual steps required to proceed with the maintenance.
  • Simplify the rollback step in case of issue by switching the traffic back to the RDS based deployment.

The Big Day: Cutover Production Traffic to New Aurora Cluster​

  • Stopped all write traffic. Waited for a minute for replication lag of DMS to come down to 0.
  • Switched all relevant apps (including readers) to Aurora flavored apps by editing Kubernetes service definition of those apps & changing app selector. Read more about label selectors here

Monitored via Grafana dashboards & Linkerd to make sure things look good

  • Start writer apps.
  • Monitored more & all was green!!

Caveats

  1. Some settings need to be enabled on existing databases to allow DMS to transfer data like enabling binary logging etc.
  2. Aurora supports MySQL 5.7 and we had MySQL 8.x running on RDS so we had to tweak some of our stored procedures to exclude functions only available in higher versions.
  • Json_table was replaced with json_extract to be MySQL 5.7 compatible.
  • Json_arrayagg was replaced with group_concat. Along with this change, group_concat_max_len had to be bumped according to our needs from default value of 1KB since this gets reached pretty easily when using group_concat.

Key Learnings & Improvements

Work area with laptop, notebook, pencils, paper clips, and plant
Photo by Monica Sauro on Unsplash

I’m still learning. — Michelangelo

After going through the migration, there are a few key learnings and improvements that were made which are worthy of being looked at.

  • Additional index creation to help with dev analytical queries since we couldn’t do that on already running production DB
  • Automation testing to increase confidence in migration
  • SQL scripts in place to create DB schema from scratch in case such a migration is needed
  • Improved availability/resiliency with native Aurora failover support
  • Looking beyond the MySQL version Aurora supports and considering Aurora’s underlying technology and its pros and cons

With this switch the platform database has more capacity, is better performing, resilient & scalable for less cost!

Huge shoutout to my colleagues Patrick Blanchette, Rolland Mewanou who were instrumental in planning, building and helping drive this project through the finish line with me.

Learn more about technology at Expedia Group

Amazon Web Services, the “Powered by AWS” logo, RDS, Aurora, and DMS are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries.

--

--