Better Crash-safe replication for MySQL
At Booking.com, we use MySQL replication extensively for read scaling. However, because we have so many slaves (we have thousands), sooner or later, one of them will fail. In MySQL 5.5 and earlier, replication was not crash-safe. MySQL 5.6 provides crash-safe replication when modifying the default configuration. But this safety comes with several downsides. Read on to find out more about crash-safe replication in MySQL 5.6 and learn how to mitigate its downsides.
MySQL replication involves two tasks on a slave: the IO thread and the SQL thread:
- The IO thread downloads events from the master binary logs and appends them to the relay logs.
- The SQL thread reads the events from the relay logs and executes them on the local database.
To keep track of its progress, each thread needs to store a state:
- The IO thread keeps track of the next event to download from the master binary logs using two variables: a filename and a position (Master_Log_File and Read_Master_Log_Pos from SHOW SLAVE STATUS).
- The SQL thread keeps track of the next event to execute using four variables: the filename and position in the slave relay logs and the two corresponding values in the master binary logs (respectively Relay_Log_File, Relay_Log_Pos, Relay_Master_Log_File and Exec_Master_Log_Pos from SHOW SLAVE STATUS).
In MySQL 5.5 and earlier, those states are persisted in files (see the master-info-file and relay-log-info-file options for more information about those files). However, the IO thread and the SQL thread cannot make progress and update their state files at the same time (atomically). This means that inconsistencies can be introduced by a crash:
- If the IO thread has appended data to the relay logs but has not updated its state file when MySQL crashes, the same event is re-appended to the relay logs after restart.
- If the SQL thread has committed a transaction but has not updated its state file when MySQL crashes, the same transaction is re-run after restart.
Those two inconsistencies are the cause of the infamous “Duplicate entry for key PRIMARY” errors when restarting a slave after a crash. This error happens when running an INSERT twice. Other errors can also happen on UPDATE and DELETE statements depending on many factors, the list of which is too long and complex to give here. In the worst cases silent data corruption will occur.
To solve inconsistency #2 above, MySQL 5.6 introduces the option to persist the state of the SQL thread in the database. By setting relay-log-info-repository to TABLE (this is not the default option), the SQL thread saves its state at the same time as it commits a transaction. If the storage engine is atomic (like InnoDB), inconsistency in the state of the SQL thread is prevented.
However, it is not possible to avoid inconsistency #1. The IO thread has no way to synchronize its progress (appending events to the relay logs) and its state (saving data in a FILE or a TABLE). The consequence is that it is impossible, once a crash has occurred, to trust both the content of the relay log and the state of the IO thread.
If we cannot trust those, how can MySQL 5.6 provide crash-safe replication? The implemented solution is to:
- Initialize the state of the IO thread to the position of the SQL thread (which can be trusted when relay-log-info-repository is set to TABLE and the storage engine is atomic), and
- Advance the position of the SQL thread to the tail of the relay logs.
In other words, all the relay logs are considered un-proper for SQL thread consumption. This behaviour is not enabled by default. To benefit from it, the option relay-log-recovery must be set to 1 (the default value is 0).
In summary, to obtain crash-safe replication in MySQL 5.6, one must:
- Set relay-log-info-repository to TABLE,
- Set relay-log-recovery to 1, and
- Use only atomic storage engines.
More information about MySQL 5.6 crash-safe replication can be found in the manual.
The problems with MySQL 5.6 Crash-safe replication
So far, everything looks good, so what are the downsides? We identified a number of problems:
- When setting relay-log-recovery to 1 in the configuration file, the relay logs are considered un-proper for SQL thread consumption every time MySQL is started.
- After a restart or a crash, if the IO thread had been ahead of the SQL thread, unnecessary traffic caused by downloading binary logs will be incurred on both the master and on the network.
- After a restart or a crash, if the master is not available, no progress can be made by the SQL thread, even if there are plenty of unexecuted relay logs on the slave.
- After a restart or a crash, if critical binary logs have been purged from the master, the slave will not be able to resume replication.
- If we want to keep a copy of the relay logs on the slave by setting relay-log-purge to 0, a restart or a crash could corrupt the stored relay logs.
Problem #1 means that all restarts must be done with extra care. If a lagging slave is restarted, it can trigger problems #2, #3, #4, and #5. Having those problems after a crash could be acceptable, but having those after a simple restart is harder to accept.
Problem #2 is the one that worries us most at Booking.com. As presented in a previous post, the network load on our master is a concern for us. The idea of a master network interface being overloaded as a result of a restart/crash of a slave is hard to accept.
A special case of problem #2 is delayed replication. In the case of a restart/crash of a delayed slave the amount of the binary log that needs to be re-downloaded is proportional to the delay of the slave. We have delayed slaves that store tens of GBs of unexecuted relay logs. A restart/crash of those would overload the network interface or a master for minutes. This is unbearable: we must find something better.
Again on problem #2, the restarted/crashed slave could be remote from the master. If the bandwidth of the WAN link is limited, it can be saturated by the slave downloading binary logs from the master. This is also an undesirable effect.
Problem #3 is less of a concern for us, but still not an enjoyable proposition (it might be more of a concern for others).
Problem #4 means that before purging binary logs, a DBA must make sure that they have been executed by all slaves. This is an additional task that DBAs could do without.
Problem #5 is more subtle since it is dependent upon the preference of our DBAs to keep a copy of the relay logs on the slave (setting relay-log-purge to 0). With relay-log-recovery=1, this copy of the relay logs cannot be trusted after a restart or a crash:
- If before a restart or a crash there are unexecuted events in the relay logs, which are appended again to the relay logs after a restart, resulting in duplicate events in the relay logs (not executed in double though).
- If just before an operating system crash some events are executed by the SQL thread but not synced to the relay logs by the IO thread, those events are missing from the relay logs after the restart.
The following bugs/feature requests have been logged to trace those problems:
- Bug # 74089: Report Relay_Log_File and Relay_Log_Pos on relay-log-recovery.
- Bug # 74321: Execute relay-log-recovery only when needed.
- Bug # 74323: Avoid overloading the master NIC on crash recovery of a lagging slave.
- Bug # 74324: Make keeping relay logs (relay_log_purge=0) crash safe.
Our wishes for MySQL Crash-safe replication
Do not get us wrong. Having a way to make replication crash-safe is a great step forward since MySQL 5.5. Still, some other small steps need to be carried out to have ideal crash-safe replication. For us, ideal crash-safe replication should:
- Not take destructive measures if no crash has happened
- Minimize the number of binary logs downloaded from the master on crash recovery
- Have consistent relay logs after crash recovery when relay_log_purge=0.
Our wish #2 above can be implemented by adding four variables to the IO thread state. Those variables would indicate the last trusted (synced) position in the relay logs and the corresponding position in the master binary logs. When restarting MySQL, the relay logs would be truncated at that last synced position and the IO thread would be initialized at the corresponding position in the master binary logs. (Note that if the IO thread was stopped correctly, nothing is truncated, which satisfies our wish #1 above.) After the truncation, if the position of the SQL thread is above (or at) the truncation point (IO thread ahead of or at the same position as the SQL thread), nothing special needs to be done. If the position of the SQL thread is after the truncation point, the behaviour will depend on relay_log_purge:
- If relay_log_purge is 1, all the previous relay logs are dropped and the IO thread position is set to the SQL position.
- If relay_log_purge is 0, more work must be done to honour the DBA’s wish to keep the relay logs. To do that, the IO thread must be started UNTIL the position of the SQL thread. When the IO thread reaches this position, the SQL thread can be started after updating its corresponding relay log position. This satisfies our wish #3 above.
Until ideal crash-safe replication is implemented in a future version of MySQL, we need to mitigate the problems caused by the current implementation (for our deployment at Booking.com, the biggest problem is unnecessarily downloading binary logs from the master). The solution we are currently investigating is to deploy a Binlog Server on our slaves:
- The Binlog Server downloads binary logs from the master
- The slave downloads the binary logs from the local Binlog Server
- If the slave is restarted (after a crash or a planned restart), it will not put undesired load on the master because the binary logs will be downloaded locally
This way, we avoid the downsides of MySQL relay log recovery by pushing the problem to the Binlog Server. And because making the Binlog Server crash-safe is trivial, we successfully work-around the downsides of MySQL relay log recovery.
But if we are not careful with this approach, we could end up consuming twice the disk space that is necessary by storing binary logs in the Binlog Server directory, and storing the relay logs in the MySQL directory.
The relay-log-space-limit option can be used to solve this problem. By setting this option to twice the max-relay-log-size (as recommended in the documentation), we put an upper limit on the extra space that can be used. But when using this option, care must be taken to set it right as changing the value will necessitate restarting MySQL . There is a feature request (bug #68236) to make the relay-log-space-limit option dynamic. Hopefully this will be implemented soon.
As promised in our last post, we presented another case for using the Binlog Server. Hopefully, this will be a temporary use that will become obsolete. When a better relay log recovery is implemented in MySQL, we will not need to deploy additional software on our MySQL servers.
We have yet another instance where using the Binlog Server is beneficial. I will present this at Percona Live London in my talk: High Availability, Disaster Recovery and Extreme Read Scaling using Binlog Servers.
: When restarting MySQL to change the option, care must be taken not to restart MySQL with relay-log-recovery=1 as this could cause the unnecessary re-downloading of binary logs. With a local Binlog Server, this becomes less of a concern.
Would you like to be an Engineer at Booking.com? Work with us!