Riding the Back to School Wave with ProxySQL

Mason Leung
Tech @ Quizlet
Published in
4 min readNov 29, 2021

In the first and second installment of this mini series, we discussed how Quizlet implemented ProxySQL in Kubernetes and re-routed all database traffic through ProxySQL with no downtime. In this last installment, we share our observations of using ProxySQL through Back to School (BTS) 2021 where we experienced our highest traffic levels.

Lower Active Database Connections

Before ProxySQL, we faced the risk associated with excessively high database connection counts. Our load tests revealed that when we reached 4000 connections per database, both queries per second (QPS) and response time began to destabilize. The picture below shows a production database with a connection count ranging between 700 to 1000 in steady state. With the blue/green deployment strategy, this count can spike up to 1500 during code deployment. The spike goes away a few minutes after the switch over and old pods are terminated. The situation worsens during code rollback when three sets of pods are running concurrently.

The ProxySQL feature that helps in this situation is connection multiplexing. It optimizes database connections by reusing them. Through BTS 2021, we noticed a 78% to 90% reduction on our top 5 busiest databases. We also observed that the connection spikes during code deployment were not as sharp, especially on the replicas. This is because ProxySQL multiplexes read connections across a thread pool.

The total number of database connections dropped sharply after the cutover was completed.

Non-impactful Rapid Ramp Up/Down of Traffic Through ProxySQL

As part of the migration and back out plan, we implemented a global “break glass” switch and individual dials that allow between 0% to 100% of database traffic to route through ProxySQL. Pre-flight load tests revealed rapid traffic switching did not cause service degradation.

After ProxySQL had been in production for a few days, we needed to refresh all the ProxySQL pods for a configuration change. It was a good opportunity to validate whether the global switch and dials would work as expected in a rapid traffic ramp down. We re-routed the traffic to four databases away from ProxySQL using individual dials, and performed a switch to re-route the remaining four databases simultaneously.

Once the ProxySQL pods were refreshed, we reverted the process to test rapid traffic ramp up. In neither case was our service delivery impacted.

We used the global “break glass” switch a few times during this BTS. It was important to validate switch functionality and that rapid traffic ramp up/down did not impact service delivery.

Errors and Warnings

After the cutover, we saw a reduction in the incidence of “user aborted connection” errors on the database. “User aborted connection” is when an application disconnects without closing a connection first. With ProxySQL handling database connections, we don’t have as many user aborted connections as before.

On the other hand, we noticed a new connection attribute warning. ProxySQL sends more than 512 bytes of connection attribute metadata to the database, but the default size on the database is set to 512 bytes. As a result, the database truncates it. However, this does not affect database performance.

Added Network Latency

The biggest question in this project was how much latency was added with ProxySQL. We expected an added 5% (p50) and 15% (p99) latency respectively in table access. We recorded the top 15 tables ordered by highest access latency over three periods

  • Before ProxySQL went live
  • Two days after ProxySQL went live but before BTS traffic
  • A month after ProxySQL went live with high BTS traffic

We saw an increase of a few milliseconds in p50 to less than 30 ms in p99 access latency. The red (latency increase) and green (latency decrease) boxes represent outliers. These outliers were caused by table access pattern changes or one-off jobs that skewed the numbers.

Table Latency

Conclusion

ProxySQL has performed well for us through BTS 2021. We accept the cost of running more infrastructure and an increase in database access latency in exchange for mitigating the serious known risk of running out of database connections. As a bonus, ProxySQL offers new capabilities like tagging and re-routing traffic according to query rules.

Acknowledgements

Multiple Quizlet engineers (Cooper Benson, Dan Cepeda, Roger Goldfinger, James Ilse, Tom Lancaster, Matt Lanier and Terrac Skiens) contributed to various parts of the project to make it successful. If you are interested in helping us build new infrastructure and run it at massive scale, please check out our careers page.

Posts in This Series

  1. Running ProxySQL as a Kubernetes Service.
  2. ProxySQL Migration with Zero Downtime.
  3. Riding the Back to School Wave with ProxySQL.

--

--