How ProxySQL saved The Hotels Network?

Published in

Hotel Tech Stories

4 min readJan 26, 2023

How ProxySQL saved The Hotels Network?

A year before the COVID pandemic started, I started my adventure at The Hotels Network (THN) because they didn’t have much experience with the only MySQL Aurora cluster that was in production. I was really curious about the case, a new company that does analytics with MySQL and had the second to last largest instance offered by AWS, the 64-core, 512 Gb RAM db.r5.16xlarge with a very high CPU load, about 3TB of data, and very high response times. It was clear that there was a challenge ahead.

Without going into too much detail, we are talking about a typical software development where everything is on a single MySQL Aurora cluster with a very basic configuration: a master (primary) and a slave (secondary / replica). The primary had almost the entire load, with a 70% write radius and a remaining 30% read radius. The only replica with very little use, but just as big as the primary, constantly alternated between one and the other because every week an automatic failover of the RDS was executed due to an OOM (Out of Memory).

Typically, when you embark on these adventures, you find yourself with a lack of monitoring of the database servers to know what is going on internally. One of the curiosities when I started collecting metrics is that there were only about 1000 QPS (Queries per seconds) and a very high CPU of around 80%, and the disk throughput high but not saturated. My experience told me that the lack of indexes could not be the reason. There was no other option but to go to the next level and activate the performance schema and the slow log, regardless of the slight degradation that this implied. It was the only way to collect more details. All this while we were learning the MySQL connected ecosystem and business logic, to know how to scale with as little change as possible.

After several weeks of analysis and testing, we started to see the light between the shadows. This was the first Internet oriented company where I could observe the use of stored procedures, and on the other hand a very high write radius, a large volume of data where indexes no longer made sense. I know it sounds strange, but let me finish telling you with simple numbers the reason to appreciate the magnitude of work that was involved. Of the 1000 stored procedures that were executed per second, each of these executed an average of 10 QPS, so we have about 1000 x 10 = 10,000 QPS, and a certain number of queries that went through a 1T table returning 1 Million records to do their calculations. We are talking about a very high level, then a volume saturated by insertions both by quantity and size. To conclude this point, we are not talking about the typical queries that you usually see, but very complex ones with dozens of JOIN, Sub Queries and everything you can imagine about SQL.

When we raised these issues with the solutions to the team, everyone collaborated without hesitation, with the intention of scaling to be able to maintain the monolith until the day that the refactor is made. By moving to microservices we managed to migrate to another more optimal technology for analytics. The latter we leave for another post.

The action plan that allowed us to scale the cluster was the following:

Certainly, we had to remove all of the stored procedures from the server and embed them in the application code, this way we will be able to unlink a majority of the traffic from the Primary and be able to distribute it among the replicas using the ProxySQL, without the need to drastically change the application.
We then set up a ProxySQL cluster and redirected all applications to pass through it.
We applied a series of rules to detect certain queries and indicated which should be executed on the primary and which on the secondary. From there, we were able to have up to 6 replicas and free resources from the primary down to a db.r5.2xlarge instance, avoid OOM, and improve performance and stability.
We still had very heavy queries that returned 1 million records, so we began to apply a series of tricks that we will discuss below to reduce time and consumption without altering the final result. Among the most important: remove grouping and sorting without need, limit by date range, and limit a certain number of records in the sub queries. All of this was to improve cardinality.

The magnitude of changes in the queries that had to be made in the code was high, and the whole process lasted several weeks, but the collaboration of all the developers, the CTO, and the CEO allowed us to achieve this change and save time for new projects.

Currently, we continue to use ProxySQL in both old and new developments, by removing the complexity in the code to divide between primary and secondary. On the other hand, it is a shield that protects and facilitates the administration of the servers, how a controlled rotation of the cluster or its nodes, for example, helped us a lot when we migrated from MySQL 5.7 to the new version of MySQL 8.0 in a controlled way.

How ProxySQL saved The Hotels Network?

Published in Hotel Tech Stories

Written by Nicola Strappazzon C