TL;DR — If you are hosting your database server in GCE, then you have to be very cautious with capacity planning. The number of CPU cores has an impact on IOPS you get. Even if your workload is not CPU intensive, you might want to provision enough depending on the kind of IOPS you are going to need. For more details, read through one of the problems we faced and the RCA.
We had a master-slave setup for one of the MySQL database servers. The MySQL server’s performance was good, we didn’t notice any performance related issues initially. But we recently noticed replication lag on the slave. and we figured out that the slave thread was very slow.
Both our master-slave servers are of the same size.
- 4 Core CPU
- 16GB Memory
- 2 databases are hosted
- 20GB is the database size.
- 700GB SSD disk for Data (15000–21000 IOPS, we have very high write IOPS transactions.)
In the process of eliminating possible causes, we started with the network. We have PMM monitoring enabled on the VMs. But nothing was off here, the bandwidth consumed was very less. It can’t be network.
Then we checked the overall operating system’s performance of the Master node. We got to know that the CPU usage is 40% but more
iowait which confirms that something wrong with the disk. There was one more metric that suggests that a lot of processes are being blocked. Something seems to be off with the disk then.
The PMM shows we are using 1.5k IOPS at 80% utilization, something is missing here, 1.5k is not 80% . Then we used a utility called
atop with 1 secto show the realtime OS performance. There we got that the disk IOPS is going between 15k to 17k.
We had 700G which supports 21k IOPS, but still, it's not moving more than 15k. This made us take a look at the GCP’s disk behavior. Then we increased the disk space to 1TB which will give us 30k IOPS.
But even after that, we were running into the same issue, the IOPS is not going more than 15k. Then we checked in the GCP console’s disk section. There we can get an estimate IOPS for this disk.
This triggered another question in our mind that something is blocking, it may something related to the hard limit in GCP. Here is the documentation which explains the limitation of the IOPS per CPU core. The CPU usage was fine, but due to the IOPS, we had to increase it to 16core CPU. The database is the write-heavy database, so we need more IOPS. Even though we have less CPU consumption, we had to increase utilize the complete IOPS of the SSD disk.
If you are hosting any write-heavy application, please plan our capacity with the combination of CPU + IOPS in GCP. They are related. After this change, the IOPS is fine. There is no Lag on the slave. We already use a separate disk for binlogs and log files. But it’s an important area to keep the IOPS under control in any production database system.