Understanding MySQL DB Replication in Its Simplest Form

Om Vikram Thapa
The Startup
Published in
6 min readAug 27, 2020
The Famous Master-Slave DB Architecture

This is one of my favourite topics and as you know everyone love Database (as far as I know) as well as everyone hate databases (Sadly, when it comes to a single point of failure). We at Goibibo Tech use MySQL to its core and we do have Master Slave database cluster in our system. Oh, wait..

Note — There is an assumption made in the blog that readers are aware of MySQL and AWS Infrastructure in its basic form.

Before jumping into DB Replication lets understand what is Cluster, HA and HA Cluster in simple terms —

Cluster is a provision where you can replicate the same instance type (be it AWS EC2 machines, Elastic Cache or RDS) to support Horizontal Scaling

HA or High Availability is the provision where you have Master-Slave combination to segregate the READ & WRITE traffic to support high availability of the database system.

HA Cluster as the name suggest itself, its a provision to have Master-Slave combination with flexibility to horizontally scale for high availability.

Below is an example of Redis DB, Redis Cluster, Redis HA Cluster -

High Availability Redis DB System

Thus we also have Master-Slave HA Cluster to support the numerous transactional and non-transactional queries on our MySQL DB.

But is this cluster sufficient enough to handle the load coming from Application Layer? Of course NOT thus we should have a DB Load Balancer or in AWS term let’s say an ELB (Elastic Load Balancer)

GitHub uses “Ferno” as DB Throttling Layer before their Master-Slave Cluster.

DB Load balancer

Good, so our set up is done and we look good to handle the DB connections, horizontally scale the DB instances on real time basis and distribute the centralised load of course. But hold on isn’t this blog is about Master Slave Replication and how does it really works? Oh yeah about that..

How does DB Replication Work?

How Replication Works?

Replication relies on three threads per master/slave connection:

One is created on the master and Two are created on the slaves.

  • The Slave I/O Thread - When you issue START SLAVE on a slave server, the slave creates this thread which connects to the master and requests a copy of the master’s binary log.
  • The Binlog Dump Thread - When the slave connects to the master, the master uses this thread to send the slave the contents of its binary log.
  • The Slave SQL Thread - The slaves creates this SQL (or applier) thread to read the contents of the retrieved binary log and apply its contents.

What is DB Replication “Lag” then?

Replication lag is caused when either the I/O Thread or SQL Thread cannot cope with the demands placed upon it.

If the I/O Thread is suffering, this means that the network connection between the master and its slaves is slow. You might want to consider enabling the slave_compressed_protocol to compress network traffic or speaking to your network administrator.

If it’s the SQL thread then your problem is probably due to poorly-optimized queries that are taking the slave too long to apply. There may be long-running transactions or too much I/O activity. Having no primary key on the slave tables when using the ROW or MIXED replication format is also a common cause of lag on this thread: check that both your master and slave versions of tables have a primary key.

Let’s see the Replication flow again NOW you will be able to understand the microscopic view :) (I hope so)

How Replication Actually Works?

Curious about bin-log? Don’t worry MySQL provides you under the hood system level queries which you can fire for eg. MySql >SHOW VARIABLES

SHOW VARIABLES shows bin log

How to Troubleshoot DB Replication Lag Issues?

There are various ways you can approach this issue whenever there is a lag and if you have the admin access you can follow the steps mentioned here -

Simply run this SQL statement in your master as well as slave node that is suspected experiencing a replication lag.

master> SHOW MASTER STATUS;
+---------------+-----------+
| File | Position |...
+---------------+-----------+
| binlog.000002 | 121752008 |...
+---------------+-----------+

slave> SHOW SLAVE STATUS;
********************* 1. row *********************
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_User: replication
Master_Port: 22808
Connect_Retry: 60
Master_Log_File: binlog.000002
Read_Master_Log_Pos: 121409852
Relay_Log_File: relaylog.000002
Relay_Log_Pos: 119819329
Relay_Master_Log_File: binlog.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...
Exec_Master_Log_Pos: 120003004
Relay_Log_Space: 121226377

The initial fields that are common to trace for problems are,

  • Slave_IO_State — It tells you what the thread is doing. This field will provide you good insights if the replication health is running normally, facing network problems such as reconnecting to a master, or taking too much time to commit data which can indicate disk problems when syncing data to disk.
  • Master_Log_File — Master’s binlog file name where the I/O thread is currently fetch.
  • Read_Master_Log_Pos — binlog file position from the master where the replication I/O thread has already read.
  • Relay_Log_File — the relay log filename for which the SQL thread is currently executing the events
  • Relay_Log_Pos — binlog position from the file specified in Relay_Log_File for which SQL thread has already executed.
  • Relay_Master_Log_File — The master’s binlog file that the SQL thread has already executed and is a congruent to Read_Master_Log_Pos value.
  • Seconds_Behind_Master — this field shows an approximation for difference between the current timestamp on the slave against the timestamp on the master for the event currently being processed on the slave. However, this field might not be able to tell you the exact lag if the network is slow because the difference in seconds are taken between the slave SQL thread and the slave I/O thread. So there can be cases that it can be caught up with slow-reading slave I/O thread.
  • Slave_SQL_Running_State — state of the SQL thread and the value is identical to the state value displayed in SHOW PROCESSLIST.

In some cases, SHOW SLAVE STATUS is not enough to tell us the culprit. It’s possible that the replicated statements are affected by internal processes running in the MySQL database slave. Running the statements

SHOW [FULL] PROCESSLIST and SHOW ENGINE INNODB STATUS also provides informative data that gives you insights about the source of the problem.

A simple POC will have much more value than theoretical knowledge so please check the same in Staging DB (Master Slave Replica) Get Dirty!!

Hope this article will be able to clarify some of the common doubts and myths about Master Slave Replication and of course the famous “Lag”. Cheers.

References :

--

--

Om Vikram Thapa
The Startup

Engineering@Goibibo - Technology, Football n Travel