MyDumper, MyLoader and My Experience of migrating to AWS RDS

Dilip Kola
Tensult Blogs
Published in
7 min readJul 2, 2019

This Blog has moved from Medium to blogs.tensult.com. All the latest content will be available there. Subscribe to our newsletter to stay updated.

Ref: https://tinyurl.com/yymj43hn

How did we migrate large MySQL databases from 3 different DB servers of total size 1TB to a single AWS RDS instance using mydumper?

Migration of database involves 3 parts:

  • Dumping the data from the source DB
  • Restoring the data to target DB
  • Replication between source and target DBs

Our customer had decided migration from Azure to AWS and as part of that needed to migrate about 35 databases running out of 3 different DB servers to a single RDS instance. RDS currently doesn’t support multi-source replication so we decided that we only set up replication from the largest DB and use dump and restore method for other 2 DB servers during the cutover period.

Setting up RDS Instance

In order to test the application end to end, and during the testing we need to change the data on the DB and that might cause issues in the DB replication process so we decided to set up a separate staging stack for testing purpose alone.

Initially, we used native mysql tools like mysqldump, but found that these tools generate a single dump file for the whole database and some of our databases are of a size more than 400GB. We have some of the triggers and views using DEFINER=`root`@`localhost` but RDS doesn’t have root user so we need to either update the DEFINER or remove it according to this documentation. We found it really challenging to update such huge dump files so then upon a suggestion from my friend Bhuvanesh, we decided to try out the mydumper tool.

Setting up a server for mydumper

We could have run mydumper from the source DB server itself, but we decided to run it in a separate server as it will reduce the load on the source DB server during the dumping and restoration phases. Now let us see how to install mydumper.

# Installers: https://github.com/maxbube/mydumper/releases# You may choose to take latest available release here.
sudo yum install https://github.com/maxbube/mydumper/releases/download/v0.9.5/mydumper-0.9.5-2.el7.x86_64.rpm
# Now we should have both mydumper and myloader commands installed on the server

Dumping data from the source

MyDumper tool extracts the DB data in parallel and creates separate files from schemas and tables data so it is easy to modify them before restoring them.

You will need give at least SELECT and RELOAD permissions to the mydumper user.

# Remember to run the following commands in the screen as it is a long running process.# Example1: Following will dump data from only DbName1 and DbName2
time \
mydumper \
--host=<db-server-address> \
--user=<mydumper-username> \
--password=<mydumper-password> \
--outputdir=/db-dump/mydumper-files/ \
--rows=50000 \
-G -E -R \
--compress \
--build-empty-files \
--threads=16 \
--compress-protocol \
--regex '^(DbName1\.|DbName2\.)' \
-L /<mydumper-logs-dir>/mydumper-logs.txt
# Example2: Following will dump data from all databases except DbName1 and DbName2
time \
mydumper \
--host=<db-server-address> \
--user=<mydumper-username> \
--password=<mydumper-password> \
--outputdir=/db-dump/mydumper-files/ \
--rows=50000 \
-G -E -R \
--compress \
--build-empty-files \
--threads=16 \
--compress-protocol \
--regex '^(?!(mysql|test|performance_schema|information_schema|DbName1|DbName2))' \
-L /<mydumper-logs-dir>/mydumper-logs.txt

Please decide the number of threads based on the CPU cores of the DB server and server load. For more information on various mydumper options, please read this. Also incase you want to use negative filters (Example2) for selecting databases to be dumped then please avoid default database such as (mysql, information_schema, performance_schema and test)

It is important to measure the time it takes to take the dump as it can be used to plan the migration for production setup so here I have used thetime command to measure it. Also, please check if there any errors present in the /<mydumper-logs-dir>/mydumper-logs.txt before restoring the data to RDS instance.

Once the data is extracted from source DB, we need to clean up before loading into RDS. We need to remove the definers from schema files.

cd <dump-directory># Check if any schema files are using DEFINER, as files are compressed, we need to use zgrep to search
zgrep DEFINER *schema*
# Uncompress the schema files
find . -name "*schema*" | xargs gunzip
# Remove definers using sed
find . -name "*schema*" | xargs sed -i -e 's/DEFINER=`[A-Za-z0-9_]*`@`localhost`//g'
find . -name "*schema*" | xargs sed -i -e 's/SQL SECURITY DEFINER//g'
# Compress again
find . -name "*schema*" | xargs gzip

Restoring data to RDS instance

Now the data is ready to restore, so let us prepare RDS MySQL instance for faster restoration. Create a new parameter group with the following parameters and attach to the RDS instance.

transaction-isolation=READ-COMMITTED
innodb_log_buffer_size = 256M
innodb_log_file_size = 1G
innodb_buffer_pool_size = {DBInstanceClassMemory*4/5}
innodb_io_capacity = 2000
innodb_io_capacity_max = 3000
innodb_read_io_threads = 8
innodb_write_io_threads = 16
innodb_purge_threads = 2
innodb_buffer_pool_instances = 16
innodb_flush_log_at_trx_commit = 0
max_allowed_packet = 900MB
time_zone = <use-source-DB-time-zone>

Also you can initally restore to a bigger instance to accheive faster restoration and later you can change to the desired the instance type.

# Remember to run the following commands in the screen as it is a long running process.time myloader --host=<rds-instance-endpoint> --user=<db-username> --password=<db-password> --directory=<mydumper-output-dir> --queries-per-transaction=50000 --threads=8 --compress-protocol --verbose=3 -e 2><myload-output-logs-path>

Choose the number of threads according to the number of CPU cores of the RDS instance. Don’t forget to redirect STDERR to file (2><myload-output-logs-path>) as it will be useful to track the progress.

Monitoring the progress of loader: it is a very long running process so it is very important to check the progress regularly. Schema files get loaded very quickly so we are checking the progress of data files only using the following commands.

# Following gives approximate number of data files already restored
grep restoring <myloader-output-logs-path>|grep Thread|grep -v schema|wc -l
# Following gives total number of data files to be restored
ls -l <mydumper-output-dir>|grep -v schema|wc -l
# Following gives information about errors
grep -i error <myloader-output-logs-path>

Verification of data on RDS against the source DB

It is a very important step to make sure that data is restored correctly to target DB. We need to execute the following commands on the source and target DB servers and we should see the same results.

# Check the databases
show databases;
# Check the tables count in each database
SELECT table_schema, COUNT(*) as tables_count FROM information_schema.tables group by table_schema;
# Check the triggers count in each database
select trigger_schema, COUNT(*) as triggers_count
from information_schema.triggers group by trigger_schema;
# Check the routines count in each database
select routine_schema, COUNT(*) as routines_count
from information_schema.routines group by routine_schema;
# Check the events count in each database
select event_schema, COUNT(*) as events_count
from information_schema.events group by event_schema;
# Check the rows count of all tables from a database. Create the following procedure:
Get Rows counts from all the tables of a database
# Run the following in both DB servers and compare for each database.
call COUNT_ROWS_COUNTS_BY_TABLE('DbName1');

Make sure that all the commands are executed on both source and target DB servers and you should see same results. Once everything is good, take a snapshot before proceeding any further. Change DB parameter group to a new parameter group according to your current source configuration.

Replication

Now that data is restored let us now setup replication. Before we begin the replication process, we need to make sure that bin-logs are enabled in source DB and time_zone is the same on both servers.

We can use the current server should be as staging DB for the end to end application testing and we need to create one more RDS instance from snapshot to set up the replication from source DB and we shouldn’t make any data modifications on this new RDS instance and this should be used as production DB in the applications.

# Get bin-log info of source DB from mydumber metadata file
cat <mydumper-output-dir>/metadata
# It should show something like below:
SHOW MASTER STATUS:
Log: mysql-bin-changelog.000856 # This is bin log path
Pos: 154 # This is bin log postion
# Set external master
CALL mysql.rds_set_external_master(
'<source-db-server>',
3306,
'<source-db-replication-user>',
'<source-db-replication-password>',
'<source-db-bin-log-path>',
<source-db-bin-log-postion>,
0);
# Start the replication
CALL mysql.rds_start_replication;
# Check the replication status
show slave status \G;
# Make sure that there are no replication errors and Seconds_Behind_Master should reduce to 0.

Once the replication is completed, please verify the data again and plan for application migration.

Make sure that you don’t directly modify the data and on DB server till the writes are completely stop in source DB and applications are now pointing to the target DB server. Also set innodb_flush_log_at_trx_commit = 1 before switching applications as it provides better ACID compliance.

Conclusion

We have learned how to use mydumper and myloader tools for migration of MySQL DB to RDS instance. I hope this blog is useful for you to handle your next DB migration smoothly and confidently. In case you have any questions, please free to get in touch with me.

--

--

Dilip Kola
Tensult Blogs

Spirtual Seeker | Mentor | Learner | Ex-Amazon | Ex-AWS | IIT Kanpur