Evaluating MySQL Parallel Replication Part 3: Benchmarks in Production

Parallel replication is a much-expected feature of MySQL. It is available in MariaDB 10.0 and in MySQL 5.7. In this 3rd post of the series, we present benchmark results from Booking.com production environments.

Photo by Master Wen on Unsplash

Note: this post has an annex: Under the Hood. Benchmarking is a complex art and reporting results accurately is even harder. If all the details were put in a single article, it would make a very long post. The links to the annex should satisfy readers eager for more details.

Parallel replication is on its way and with it comes the hope that more transactions can be run on a master without introducing slave lag. But it remains to be seen whether this dream will come true — will parallel replication hold its promise or will there be surprises after its deployment? We would like to know whether or not we can count on that feature in the future.

To get answers, nothing is better than experimenting with the technology. Benchmark results have already been published (MariaDB 10.0: 10 times improvement; and MySQL 5.7: 3 to 6 times improvement) but results might be different in our production environments. The best test would be to run parallel replication on our real workloads but this is not trivial. To be able to run transactions in parallel, a slave needs parallelism information from the master (for more details, see Part 1). With a master in an older version (MySQL 5.6 in our case), slaves do not have this information.

Luckily, we can use slave group commit on an intermediate master to identify transactions that can be run in parallel. The trick is to execute transactions sequentially on a slave, but to delay their commit. While the commit is delayed, the next transaction is started. If the two transactions are non-conflicting, the second could complete and the two transactions would commit together. In this scenario, grouping has succeeded (the two transactions committed in a single group) and parallelism is identified (as they commit together, the transactions are non-conflicting, thus can be run in parallel on slaves). For more details on slave group commit, see Part 2.

Our benchmarks are established within four production environments: E1, E2, E3, and E4. Each of these environments is composed of one production master, with some intermediate masters, and a leaf slave. A complete description of those environments can be found in the annex.

Slave group commit identifies less parallelism than a parallel execution on a master would identify (details in the Group Commit: Slave vs. Master section of the annex). Even so, decent group sizes are obtained, as shown in the group commit size graphs in the annex. Usually we have group sizes of at least 5, most of the time they are larger than 10, and sometimes they are as big as 15 or even 20. These will allow us to test parallel replication with real production workload.

Before closing this long introduction, let’s talk a little about our expectations. InnoDB is getting better at using many cores (RO and RW benchmark results) but single-threaded replication gets in the way of pushing more writes in a replicating environment. Without parallel replication, a single core can be used on a master to perform writes without incurring slave lag. This is disappointing as servers come with 12 and more cores (only one can be used for writes). Ideally, we would like to use a significant percentage of the cores of a server for writes (25% could be a good start). So speedups of 3 would be good results at this point (12 cores), and speedups of 6 and 10 would be needed in the near future (24 and 40 cores).

The Test: Catching Up with 24 Hours of Transactions

Our benchmark scenario is as follows: after restoring a backup of the database, starting MariaDB, waiting for the buffer pool to be loaded, and running the slave for at least 12 hours, we measure the length of time it has taken to process 24 hours of production transactions.

The tests are run in the following binary log configurations:

  • Intermediary Master (IM): both binary logs and log-slave-updates are enabled.
  • Slave with Binary Logs (SB): binary logs are enabled but log-slave-updates is disabled.
  • Standard Slave (SS): both binary logs and log-slave-updates are disabled.

And in the following durability configurations:

  • High Durability (HD): sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1.
  • No Durability (ND): sync_binlog = 0 and innodb_flush_log_at_trx_commit = 2 (also described/known as relaxed durability).

For each of those configurations (6 in total: IM-HD, IM-ND, SB-HD, SB-ND, SS-HD, and SS-ND), the tests are run with different values of slave_parallel_threads (SPT). The full results are presented in the annex and the most interesting results are presented below (SB-HD and SB-ND). The times presented are in the format hours:minutes.seconds. Below the time taken to process 24-hours of transactions, the speedup achieved from the single-threaded run is presented in bold.

Execution Times and Speedups for Slave with Binary Logs

In the Graph during Tests section of the annex, you can find many details about the different test runs.

Discussion

There are lots of things to say about those results. Let’s start with observations that are not related to parallel replication (SPT=0):

  • Obs1: Standard Slave results (without binary logs) are very close to the results of Slave with Binary Logs (without log-slave-updates) (details in the annex).
  • Obs2: log-slave-updates has visible cost for E1 (time difference between IM-HD and SB-HD), a less obvious but still noticeable cost for E2 and E3, and it is a win for E4 (that last one is disturbing, the numbers are in the annex).
  • Obs3: relaxing durability is a huge win for E1 and E2, a more limited win for E3, and a much smaller one for E4.

With reference to Obs1 above, this shows that binary logs should probably not be disabled on slave: the cost is almost inexistent and the wins are big (tracing errant transactions and being a candidate for master promotion). However, slaves with log-slave-updates are slower than slaves with only binary logs enabled (Obs2 above), so log-slave-updates should be avoided when possible. Binlog Servers can be used to replace log-slave-updates for intermediate masters, see MySQL Slave Scaling for more details (see also Better Parallel Replication for MySQL for an explanation why log-slave-updates is bad on intermediate masters for parallel replication).

With reference to Obs3 above, this can be explained by the different workload of the four environments (more details about the workloads can be found in the annex):

  • E2 is a CPU-bound workload (the dataset fits in RAM).
  • E1 is also mostly CPU-bound but with some cache misses in the InnoDB buffer pool, so it needs a page fetch from disk before doing a write.
  • E3 is a mixed CPU and IO workload (more cache misses in the InnoDB buffer pool but still with enough cache hit to get a good commit throughput).
  • E4 is an IO-bound workload (mostly cache misses).

Relaxing durability on CPU-bound workloads achieves good throughput improvements, but this does not happen on IO-bound workloads.

Now, let’s focus on parallel replication. The Standard Slave results (SS-HD and SS-ND) are not worth discussing as they are very close to the Slave with Binary Logs results (Obs1 above). We will also not discuss Intermediate Master results (IM-HD and IM-ND) as they should be replaced by Binlog Servers. So all observations below are made on the results of Slave with Binary Logs (SB-HD and SB-ND):

  • Obs4: the best speedup (~2.10) is in E2 with high durability. E1 follows with a speedup of ~1.56 (always with high durability).
  • Obs5: the speedups for E4 are modest (~1.29) and the results are almost identical for both durability settings.
  • Obs6: for E1 and E2, the speedups with no durability are almost non-existent (less than 1.10).
  • Obs7: for both E1 and E2, relaxing durability with single-threaded replication leads to faster execution than enabling parallel replication.
  • Obs8: the results for E3 are halfway between E1/E2 and E4: both SB-HD and SB-ND get some modest speedups from parallel replication (like E4 and opposite to E1/E2) and relaxing durability makes things run a little faster (like E1/E2 and opposite to E4), but not to the point where single-threaded low durability is faster than multi-threaded high durability.

All those observations point to the importance of the workload in parallel replication speedups:

  • CPU-bound workloads seem to get modest speedups in high-durability configurations.
  • Relaxing durability for CPU-bound workloads looks like a better option than enabling parallel replication on a high-durability configuration.
  • IO-bound workloads get more limited speedups.

Our first reaction is disappointment: the speedups are not as high as expected. Don’t get us wrong: faster is always better, especially when the only thing to do is to upgrade the software, which we will do anyway eventually. However, having only 25% more writes on a master (or 110% depending on which environment we look at) will not help us in the long term. Parallel replication is not the solution (at least not the only solution) that will allow us to stop/avoid sharding.

Ideas and Future Work

We have a hypothesis explaining the modest speedups: long-running transactions. In the presence of long-running transactions, the parallel replication pipeline on the slave stalls. Let’s take the following six transactions committing on the master in two commit groups (B for begin and C for commit):

--------Time-------->
T1: B-----------C
T2: B--C
T3: B--C
T4: B--C
T5: B--C
T6: B--C

Running those transactions on a single-threaded slave takes 33 units of time (time scale is at the bottom):

----------------Time---------------->
T1: B-----------C
T2: B--C
T3: B--C
T4: B--C
T5: B--C
T6: B--C
1 2 3
123456789012345678901234567890123

Running those transactions on a multi-threaded slave with SPT=4 takes 17 units of time:

---------Time-------->
T1: B-----------C
T2: B-- . . . . C
T3: B-- . . . . C
T4: B-- . . . . C
T5: B--C
T6: B--C
1
12345678901234567

So we barely achieve a speedup of 2 (and the second commit group does not even contain a large transaction). The low speedup is explained by T1 being much bigger than the other transactions in the group. So our intuition is that to get better speedup with parallel replication, all transactions should be of similar size, and bigger transactions should be broken down into smaller ones (when possible).

We have many of those big transactions in our workload at Booking.com. Most of our design choices predate MySQL 5.6, where a commit was expensive. Reducing the number of commits was a good optimization at that time, so doing many changes in a single transaction was a good thing. Now, with binary log group commit, this optimization is less useful but does not harm. However, this optimization is very bad for parallel replication.

There are at least two other things to discuss from those results but this post is already too long, so you will have to go in the annex to read the Additional Discussions.

Conclusion

It is possible to test parallel replication with true production workload, even if the master is running an old version of MySQL. Thanks to slave group commit in MariaDB 10.0, we can identify parallelism on intermediate master and enable parallel replication on a slave. Even if this parallelism identification is not as good as it would be on a master, we get decent group sizes.

Our CPU-bound workloads are getting speedups of ~1.56 to ~2.10 with high-durability constraints. This is a little disappointing: we would like to have more than two cores busy applying writes on slaves. Our guess is that better speedup could be obtained by hunting down large transactions, but that still needs to be verified. At this point, and for this type of workload, our tests show that relaxing durability is a better optimization than enabling parallel replication. Finally, with relaxed durability, parallel replication shows almost no improvement (4% to 6% improvement), and it is still unknown if hunting down large transactions and splitting them would result in better speedups.

Our IO-bound workloads are getting speedups of ~1.23 to ~1.29, which is also disappointing but expected because it is hard to fight against seek time of magnetic disks. In this type of workload, relaxed durability setting benefits from parallel replication. However, at high enough parallelism on the slave and for this type of workload, relaxing durability is not very beneficial. It is hard to tell what types of improvement would come from hunting down large transactions for this type of workload.

The next step on parallel replication evaluation would be to try optimistic parallel replication. This will make a good fourth part in the series.


Would you like to be an Engineer at Booking.com? Work with us!