ProxySQL Migration with Zero Downtime
Previously, we described how we set up ProxySQL as a Kubernetes service. In this second installment, we discuss the steps taken to complete the migration to ProxySQL-handled database connections without incurring any downtime.
Data reliability is vital to Quizlet. Any operations that alter the flow of data between applications and databases must be handled with care. We put a lot of effort into building safety harnesses and designing a back-out plan to prevent downtime during the ProxySQL migration.
New Database Configuration
The web application consumes a JSON file for database configuration. The first step is to replicate this file, but modify the host ip to connect to ProxySQL instead. Noted the ipaddress value in the JSON file is not an actual IP address. This is a cluster IP referenced through the Kubernetes resource namespace.
Global Break Glass Switch and Dials
The second step is to add application logic to pick the original database configuration or the one that uses ProxySQL. This “break glass” switch is our way to back out of ProxySQL completely. Additionally, there are separated dials associated with each database. This allows us to dial the traffic flowing through ProxySQL from 0 to 100%.
Load Test with ProxySQL
Load testing was the most time-consuming part but it was a good investment. First, it allowed us to test the new database configuration file as well as the global kill switch and dial logic thoroughly. But most importantly it provided us with an idea of the impacts (in particular network latency) expected from adding ProxySQL. We performed two types of load tests before the migration.
Test Types
Warm Up URLs: These are the addresses of site pages we use for the Kubernetes pod readiness check. Getting a successful response verifies that a pod can successfully access the OLTP and cache layers.
Synthetic load: We profiled the production workload on a database and regenerated the queries with synthesized parameters.
ProxySQL MySQL Threads Variable
In the first load test, we noticed the request per second (RPS) and response time were not stable (graph below). RPS rose and stabilized for a few minutes and dropped off. Same cyclic pattern observed on the response time as well. This was caused by a mismatch between the ProxySQL mysql-threads variable and the CPU cores assigned to a ProxySQL pod.
Once the mysql-threads variable was fixed, RPS and response time from subsequent load tests stabilized.
Additional Network Latency
With ProxySQL as the middleware, we expected network latency to go up.
Interestingly, we saw a 4.3% (p50) and 15% (p99) increase in network latency from calling warm up URLs but not from the synthetic load test. One explanation is that a warm up URL page makes multiple calls to databases and synthetic load tests do not. The test unit of synthetic load is per query and the warm up URL is per page. Using a page as a test unit allows us to see the aggregated effect.
Next, we will reveal and discuss our findings of using ProxySQL through Back to School (BTS) 2021 in the next part of the series.