MySQL Slave Scaling (and more)
At Booking.com, we have very wide replication topologies. It is not uncommon to have more than fifty (and sometimes more than a hundred) slaves replicating from the same master. When reaching this number of slaves, one must be careful not to saturate the network interface of the master. A solution exists but it has its weaknesses. We came up with an alternative approach that better fits our needs: the Binlog Server. We think that the Binlog Server can also be used to simplify disaster recovery and to ease promoting a slave as a new master after failure. Read on for more details.
When having many slaves replicating from the same master, serving binary logs can saturate the network interface of the master as every change is requested by every slave. It is not unusual to have changes that generate lots of binary logs, two examples are:
- deleting lots of records in a single statement when using row based replication
- executing an online schema change on a large table
In the replication topology described in Figure # 1, producing one megabyte of binary logs per second on M will generate replication traffic of a hundred megabytes per second if we deploy a hundred slaves. This is very close to the limit of a 1 Gbit/s network interface and this is something we see happening in our replication chains.
The traditional solution to this problem is to place intermediate masters between M and its slaves. In the deployment of Figure # 2, instead of having slaves directly connected to M, we have intermediate masters replicating from M with slaves replicating from each intermediate master. When having 100 slaves and ten intermediate masters, this allows producing ten times more binary logs on M before saturating the network interface (ten MB/s instead of one MB/s).
However, using intermediate masters has its problems:
- replication lag on an intermediate master will generate delay on all of its slaves
- if an intermediate master fails, all of its slaves stop replicating and they need to be reinitialized 
Diving deeper on the second problem in the context of Figure # 2, one could think that, in case of a failure of M1, its slave could be repointed to the other intermediate masters but this is not that simple:
- S1 replicating from M1 depends on the binary logs of M1,
- M1 and M2 have different binary logs (they are different databases),
- manually extrapolating the position of S1 in the binary log of M2 is hard.
GTID can help us in repointing slaves, but it will not solve the first problem above.
It should be noticed that we do not need the databases in the intermediate layer at all: we only need to serve binary logs. And, if the binary logs served by M1 and M2 were the same, we could easily swap each of their respective slaves. From those two observations, we build the idea of the Binlog Server.
Binlog Servers replace intermediate masters in Figure # 2. Each Binlog Server:
- downloads the binary logs from the master
- saves them to disk using the same structure as the master (filename and content)
- serves them to slaves as if they were served from the master
And of course, if a Binlog Server fails, we can simply repoint its slaves to the other Binlog Servers. Even better, since these hosts do not apply changes to a local dataset before serving them downstream, latency is greatly improved in comparison to using actual database servers.
We are working in collaboration with SkySQL to implement the Binlog Server as a module to the MaxScale pluggable framework. You can read this blog post by SkySQL for an introduction on MySQL replication, MaxScale and the Binlog Server.
Other use-case: avoiding deep nested replication on remote sites
The Binlog Server can also be used to reduce the problems with deep nested replication on remote sites.
If someone needs four database servers per site on two sites, the topology from Figure # 3 is a typical deployment when WAN bandwidth is a concern (E, F, G and H are on the remote site).
But this topology suffers from the problems explained above (replication delay on E slows down F, G and H, and losing F, G and H in case of a failure of E). It would be better if we could set it up like Figure # 4 but it needs more WAN bandwidth and, in case of a disaster, on the main site, the slaves of the remote site need to be reorganized in a new tree.
Using a Binlog Server on the remote site, we can combine the best of both solutions (low bandwidth usage and no delay introduced by an intermediary database). The topology becomes the following:
The Binlog Server (X) looks like a single point of failure in the topology from Figure # 5 but if it fails, it is trivial to restart another one. It is also possible to run two Binlog Servers on the remote site as illustrated in Figure # 6. In this deployment, if Y fails, G and H are repointed to X. If X fails, E and F are repointed to Y and Y is repointed to A.
Note that running Binlog Servers does not necessarily mean more hardware. In Figure # 6, X can be installed on the same server as E and Y on the same server as G.
Finally, those deployments (with one or two Binlog Servers) have an interesting property: if a disaster happens on the main site, the slaves on the remote site will converge to a common state (from the binary logs available on X). This makes reorganizing the slaves in a new replication tree easy:
- any slave can be the new master
- the binary log position of the new master is noted before sending write to it (SHOW MASTER STATUS)
- the other nodes are placed as slaves of this new master at the position noted in the previous step
Other use-case: easy high availability
The Binlog Server can also be used as a high availability building brick. If we want to be able to elect a new master in case of the failure of A in Figure # 7, we can deploy GTIDs or use MHA but both have downsides.
If we deploy a Binlog Server between the master and the slaves as illustrated in Figure #8:
- if X fails, we can repoint all the slaves to A
- if A fails, all the slaves will converge to a common state which makes reorganizing the slaves in a new replication tree easy (explained above)
If we want to combine extreme slave scaling and high availability, we can use the topology of Figure # 9.
If a Binlog Server fails, its slaves are repointed to the other Binlog Servers. If A fails:
- we find the Binlog Server that has more binary logs (we suppose in this example that it is I2)
- we repoint the other Binlog Servers to I2 as illustrated in Figure # 10
- and once all the slaves have converged to a common state, we reorganize the topology
We have presented a new component that we are introducing in our replication topologies: the Binlog Server. It allows us to horizontally scale our slaves without fear of overloading the network interface of the master and without the downsides of the traditional intermediate master solution.
We think that the Binlog Server can also be used in at least two other use-cases: remote site replication and easy topology reorganization after a master failure. In a follow up post, we will describe another use-case of the Binlog Server. Stay tuned for more details.
 Slave reinitialization can be avoided using GTIDs or by using highly available storage for intermediate masters (DRBD or filers) but each of those two solutions brings new problems.
Would you like to be an Engineer at Booking.com? Work with us!