Evaluating MySQL Parallel Replication Part 3, Annex: Under the Hood

There is no introduction or conclusion to this post, only landing sections: reading this post without its context will probably be very hard. You should start with the main post and come back here for more details.

Photo by Master Wen on Unsplash

Environments

As in the two previous posts (Part 1 and Part 2), we are using the same four environments. Each environment is composed of four servers:

  • A is a true production master running MySQL 5.6,
  • B is an intermediate master running MariaDB 10.0.16 without parallel replication enabled (slave_parallel_threads = 0),
  • C is an intermediate master running MariaDB 10.0.16 with slave group commit enabled (slave_parallel_threads > 1, binlog_commit_wait_count > 1 and binlog_commit_wait_usec > 1),
  • D is the slave where parallel replication tests are run with different parameters.

The slave group commit parameters on C are the following:

To monitor parallelism identification (group committing) on C, the binlog_commits and binlog_group_commits global statuses are gathered regularly.

The D slave was stopped on a Thursday between 17:20 and 17:30 and a backup of the database was taken. Every test starts by recreating the filesystem that hosts the database and restoring the backup on this empty filesystem.

When stopping the database (for backup), the buffer pool was dumped using innodb_buffer_pool_dump_at_shutdown. When starting MariaDB (after backup restoration), innodb_buffer_pool_load_at_startup was used to load the buffer pool. No test was started before the buffer pool was fully loaded.

Every test then started by running the D slave until Friday at 06:00 (a little more than 12 hours). This allowed further warming up of InnoDB, including giving work to purge and write threads.

Then, the D slave ran 24 hours of transactions where parallelism was identified using slave group commit on the C intermediate master. The number of commits and the group commit sizes on C for that period are shown in the graphs below for the four test environments.

Graphs # 0: Commits and Group Commit Sizes in the four Environments

The D servers have the following properties:

To monitor the commit rate on D, the COM_COMMIT global status is gathered regularly.

To prevent D from filling up its disk, relay_log_space_limit is set to 4 GB (some D do not have enough disk space to store all the binary logs of C).

Group Commit: Slave vs. Master

It is briefly mentioned in the main post that slave group commit identifies less parallelism than a true master would. More details are given here.

Let’s consider this execution timeline which shows a master where T5 is conflicting with one of the four previous transactions:

Using slave group commit and from the binary logs of the timeline above, parallelism can be identified in two groups as shown below.

But if we run the master with delayed commit, we could get the execution timeline below, in which T6 passed T5 to join the first commit group.

So on a master, transactions blocked in their execution do not block other non-conflicting transactions from joining a commit group, but those block parallelism identification in slave group commit. This is why slave group commit is not as efficient at maximizing group size as a parallel execution on the master would be.

Results

As outlined in the main post, our tests are run in the following binary log configurations:

  • Intermediary Master (IM): binary logs and log-slave-updates enabled.
  • Slave with Binary Logs (SB): binary logs enabled and log-slave-updates disabled.
  • Standard Slave (SS): binary logs and log-slave-updates disabled.

And in the following durability configurations:

  • High Durability (HD): sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1.
  • No Durability (ND): sync_binlog = 0 and innodb_flush_log_at_trx_commit = 2 (also described/known as relaxed durability).

For each of those configurations (6 in total: IM-HD, IM-ND, SB-HD, SB-ND, SS-HD, and SS-ND), the tests are run with different values of slave_parallel_threads (SPT). The full results are presented in the following four tables. The times presented are in the format hours:minutes.seconds. Below the time taken to process 24 hours of transactions, the speedup achieved from the single-threaded run is presented (in bold) and when applicable, the speedup from the previous run is also presented (in normal font).

Normally, the tests are run at least twice and the shorter time is kept. This is relaxed for E1, E3, and E4 in the SS-HD and SS-ND configurations where only one run is performed: the results from E2 show that SS-HD and SS-ND are very similar to SB-HD and SB-ND respectively, and the limited results with E1, E3, and E4 are consistent with this observation. Moreover, again for E1, E3, and E4 in SS-HD and SS-ND and for the same reasons, some SPT values are skipped (the skipped values are shown as N/A in the tables below).

Table # 1: E1 Execution Times and Speedups
Table # 2: E2 Execution Times and Speedups
Table # 3: E3 Execution Times and Speedups
Table # 4: E4 Execution Times and Speedups

Graphs during Tests

If you spot something we might have missed in the graphs below, please post a comment. Those graphs include the number of commits per second, CPU stats and Read IOPS for all environments, for the Slave with Binary Logs configuration (log-slave-updates disabled), in both durability settings (high and no/relaxed).

Graphs # 1a: E1 Stats — Slave with Binary Logs — High Durability
Graphs # 1b: E1 Stats — Slave with Binary Logs — Relaxed Durability
Graphs # 2a: E2 Stats — Slave with Binary Logs — High Durability
Graphs # 2b: E2 Stats — Slave with Binary Logs — Relaxed Durability
Graphs # 3a: E3 Stats — Slave with Binary Logs — High Durability
Graphs # 3b: E3 Stats — Slave with Binary Logs — Relaxed Durability
Graphs # 4a: E4 Stats — Slave with Binary Logs — High Durability
Graphs # 4b: E4 Stats — Slave with Binary Logs — Relaxed Durability

Workloads

It is mentioned briefly in the main post that the four test environments have different workloads:

  • E2 is a CPU-bound workload (the dataset fits in RAM).
  • E1 is also mostly CPU-bound but with some cache misses in the InnoDB buffer pool, needing a page fetch from disk before doing a write.
  • E3 is a mixed CPU and IO workload (more cache misses in the InnoDB buffer pool but still with enough cache hit to get a good commit throughput).
  • E4 is an IO-bound workload (mostly cache misses).

We can confirm that E2 does not do much IO looking at IOWait and Read IOPS in the graphs below (commit throughput is also good).

Graphs # 5: E2 is CPU-bound (good commit throughput, low IOWait and few Read IOPS).

E1 has a little more IOWait and Read IOPS but still not that much as you can see in the graphs below. However, commit throughput is smaller than E2, but still good (the transaction sizes are bigger).

Graphs # 6: E1 is also mostly CPU-bound (good commit throughput, low IOWait and few Read IOPS).

E4 is clearly an IO-bound workload: commit throughput is low, IOWait is high, and Read IOPS are high as shown in the graphs below.

Graphs # 7: E4 is IO-bound (low commit throughput, high IOWait and high Read IOPS)

E3 has a much higher commit throughput than E4 but with lots of Read IOPS as shown in the graphs below. This makes us put this workload in a mixed CPU and IO-bound category (confirmed by a high IOWait but still lower than E4).

Graphs # 8: E3 is mixed CPU and IO-bound (respectable commit throughput, high IOWait and high Read IOPS)

Note: the Write IOPS are not as important because they are buffered in the RAID controller write cache. The Read IOPS are important because a write operation to the database that needs an InnoDB page fetch from disk is bounded by the disk seek time.

Additional Discussions

Another Problem with Long-Running Transactions

In the main post, we discussed the impact of long-running transactions on the parallel replication pipeline (idle threads while the long transaction is running). When the number of slave threads is not smaller than the group commit size, the time taken to execute the group is bound by the time taken to run the longest transaction, as shown below.

The transactions above on the master are executed as below on the slave with SPT=4.

But if the number of slave threads is smaller than the group commit size, we get a longer run, as shown below (SPT=2):

So we lost 4 units of time in the global run by decreasing SPT. What is interesting is that there is enough time for executing both T3 and T4 while T1 executes (and while the thread running T2 is idle), so maybe it is possible to avoid this situation.

Note that MySQL 5.7 solves this problem by allowing out-of-order commit. With slave-preserve-commit-order set to 0 (the default), MySQL will commit T2 without waiting for T1 to commit. This will allow T3 to start executing while T1 is still running. However, this changes the behavior of the slave: T2 (and eventually T3 and T4) becomes visible before T1, which is not possible in single-threaded replication. Some people (including the author of this post) might be nervous about this and would prefer to set slave-preserve-commit-order to 1.

A future solution to manage long-running transactions and group size larger than SPT could be to delegate committing transactions. If the thread running T2 delegates its commit to another thread, it can start running T3 event if T2 is not committed. This could be an interesting optimization to implement in MariaDB or in MySQL.

(Note that the problem described above — group size larger than SPT — cannot explain the modest speedup observed in our tests as we identify parallelism with binlog_commit_wait_count = 35 and we run tests with SPT up to 40.)

Relaxed Durability Burns CPU for E1 and E2

As presented in the main post, E1 and E2 have respective speedups of ~1.06 and ~1.04 for SB-ND. However, as shown in the graphs below, while increasing SPT, CPU consumption also increases noticeably without significantly improving the commit throughput.

Graphs # 9: E1 Stats — Slave with Binary Logs — Relaxed Durability
Graphs # 10: E2 Stats — Slave with Binary Logs — Relaxed Durability

Either we hit a synchronization bottleneck in the parallel applier or there is room for optimization: there is probably interesting work to be done here.

Would you like to be an Engineer at Booking.com? Work with us!

--

--