Step by Step Story: How to become MySQL replication functional without stopping master at Planeta Huerto

Published in

planetahuerto

4 min readMay 8, 2020

This story starts two years ago, when we moved our database platform to Google Cloud and the application scripts were away from the persistence.

At the first step we had to stop the MySQL service and configure it as master. This is required to make your master-slave project functional 😃. So for this part we only had to configure the service like this:

Added in /etc/mysql/my.confserver-id               = 1
log_bin                 = /var/log/mysql/mysql-bin.log
expire_logs_days        = 10
max_binlog_size         = 100M
replicate-do-db         = my_database_not_yours
binlog_do_db            = my_database_not_yours

On the mysql console:

CREATE USER 'repl'@'%' IDENTIFIED BY 'y0ur_p4ssw0rd_h3r3';GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';

Ok! right! cool! we had our service active, great! But the system was not completed yet and we created a slave mysql service. Where? outside the cloud platform, near the application running code. The latency from the master was around ~80ms. We set up the slave like this:

Added in /etc/mysq/my.confserver-id               = 2
log_bin                 = /var/log/mysql/mysql-bin.log
expire_logs_days        = 10
max_binlog_size         = 100M
replicate-do-db         = my_database_not_yours
binlog_do_db            = my_database_not_yours

On the mysql console:

CHANGE MASTER TO MASTER_HOST='your.master.server.ip', -MASTER_USER='repl', MASTER_PASSWORD='y0ur_p4ssw0rd_h3r3';

Well, at this point the following task was to ensure that the application was not storing and querying data in the same process. Uff… people angry… IT team overloaded… you know…

After everything was working relatively fine, the following step was move the application to the cloud and the world became to us with another color, more beautiful, people happy, IT team relaxed… you know… The slave service is not needed anymore… bye bye slave!

After few months a new BI department was created with, of course, intensive, heavy and slow queries. The first approach was create another instance of the database where we were restoring backup data each hour. The BI system was taking this system as data source for a period of time. Which problems became? When the BI process overlaps data restoration the results were unexpected behaviour. Some tables dissapeared during load (the restore backup process was dropping tables and a creating them again) so the selects from BI system were affected.

We decided that the better solution was start replication from the current master server and we did it without stopping master. How?? that’s the question… The recipe is more or less like this:

First of all: prepare the environment for the replication. New server well configured. Do not start the replication yet.
Load the data from a backup. Much newer, much better.
Start the replication. (Use the start slave; MySql command)
Be prepared for crash!!

What happened at this point? when the replication tries to update data sometimes the entry is not present or is present and it shouldn’t be. But, what can we do now? the slave is stopped! god! So, keep calm and use the following command:

myqsl> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: your.master.server.ip
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.005214
          Read_Master_Log_Pos: 74844750
               Relay_Log_File: mysql-2-bi-vm-relay-bin.000002
                Relay_Log_Pos: 2959353
        Relay_Master_Log_File: mysql-bin.005214
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: my_database_not_yours
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1032
                   Last_Error: Could not execute Delete_rows event on table my_database_not_yours.my_table; Can't find record in 'my_table', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.005214, end_log_pos 34432626
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 34432371
              Relay_Log_Space: 43375998
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Delete_rows event on table my_database_not_yours.my_table; Can't find record in 'my_table', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.005214, end_log_pos 34432626
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
                  Master_UUID: 39756898-2678-11e7-b650-0242ac120002
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 191130 10:40:12
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

What can I do if the entry is not present? Well you can load a new backup and start again. Other solution is to copy the affected table from the master to the replication server and start again. Each solution could work depending on the table’s update rate.

What can I do if the entry is present and it should not be there? Well, perhaps what happened is that the entry has been removed from the master after the backup was done and the replication starts from a later point. What you can do is to load again your backup or skip the sentence which is causing the problem.

mysql> stop slave;
mysql> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
mysql> start slave;
myqsl> show slave status\G;

These commands skip one sentence from the binlog and starts the replication system again.

One thing that helped us to start from a recent backup was to run the following command:

$ mysqldump -uuser -p -h your.master.server.ip --master-data=1 --add-drop-database --routines --single-transaction --quick --lock-tables=false my_database_not_yours <your complete list of tables if you need them> | mysql --defaults-file=defaults.cnf -D my_database_not_yours -h your.replication.server.ip

This command adds master binlog information to the dumped data so the replication system can use it to start from this point.

I don’t want really lie you, it was painful for at least a month but we could start the replication system without stopping the master database. Perhaps there are other solutions, better solutions I think, but we did it this way.

I hope this story could help someone out there!!!

Until the next story!! Bye!!

Step by Step Story: How to become MySQL replication functional without stopping master at Planeta Huerto

Written by Alberto Hita