Efficiently synchronizing databases for production-like environment | RDS Aurora

Published in

MiQ Tech and Analytics

5 min readJul 16, 2019

When do you need production data?

With the concept of seed data, there remains a false notion that you no longer need production data in your development and test environment. But, there are cases where raw production data are vital:

It’s hard to reliably test for issues and bugs if you don’t have an accurate replica model
It’s hard to reproduce production bugs in a testing environment.
You can’t be confident that an application you’re developing won’t fall over the moment you deploy it into your production environment because sometimes seed data isn’t deep and rich enough to cover edge cases.
Any performance testing isn’t complete unless you have the same volume of data as your production.

What are we trying to achieve

We use Amazon RDS Aurora as our production database. The problem was to find a fast, easy and secure way to replicate raw production data in the testing environment. This is also an automation and process problem along with a technology problem.

What’s the solution

After some research, we narrowed down the available solutions which could be used to replicate the data. Here are some of the different options:

Mysqldump

You can use the native mysqldump command to take a database dump and restore it in testing DB. This approach is a basic approach and isn’t the most efficient for dumping complete production data because:

It involves heavy IO operations, which will perform frequent disk writes making the complete process very slow. It may take hours to finish the dump process for a few GB of data.
The process can be buggy if you don’t use certain options to make sure the same point-in-time for all the tables and ignore any incoming changes after the dump process is initiated.
The restore process is even slower than the dump as it involves rebuilding indices from scratch and data must be sorted and organized to achieve this.
The process isn’t cost-effective as it involves IO and network costs for transferring data over the network.

Binary logs

Another way to achieve replication is by applying binary logs to the testing database periodically after it is restored from production DB. You will need binary logs from the time when the last restore happened to apply to the testing DB. This point-in-time recovery can bring your database up to the current committed transactions. This approach couldn’t be used because:

With the binary logging approach, it’s not recommended you perform write operations on testing DB as it may break replication between the master and slave (due to data inconsistency) and may lead to replication errors.
Enabling binary logs will consume some storage as it logs all the changes made to the data.
This service is not available to replicate between two Aurora clusters which are present in the region.
This will have a slight performance impact as there will be overhead in writing binary logs.
Enabling binary logging on Aurora directly affects the recovery time after a crash because it forces the DB instance to perform binary log recovery.
As RDS is a managed service, you can’t apply this kind of replication, though it can be done for any standalone DB.

Snapshot

This is a better approach than the above approaches. In this process, you periodically take a snapshot of production DB and restore it to testing DB. This creates a new cluster with the exact configuration of production DB. You need to change the DB identifier to bring the cluster up. Once the cluster is created, you can change configurations like users, security groups, parameter groups, IAM role and tags. This is a fast approach. The only downside with this is the storage used. Data in production is copied to another location which is referred by newly created testing DB. This process needs to be done periodically in order to have production data in the testing environment for most times.

DB cloning

Using database cloning, you can quickly and cost-effectively create clones of your database. The cloned databases require only minimal additional space when first created. Database cloning uses a copy-on-write protocol, in which data is copied at the time that data changes, either on the source databases or the clone databases.

Both the source database and the clone database point to the same pages. On receiving any change by either of the databases, instead of writing to the original page, additional storage is used to create a new page as shown below:

The process takes very little time. Like the Snapshot method, you have to make configuration changes like changes in DB identifier, add testing users etc. There is no mechanism to automate the process of deleting old clones and creating new ones from AWS. But you can make use of AWS CII and write a custom shell script to perform the same task.

We have achieved this automation using Jenkins. Use the link below to understand how DB cloning works internally and how it saves storage compared to the DB snapshot process.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Clone.html

https://aws.amazon.com/blogs/aws/amazon-aurora-fast-database-cloning/

This replication process is kick-started by Jenkins job which involves the below steps:

The job takes input parameters as the target database and starts the replication on demand along with scheduled replication every two days.
It deletes the target DB, creates a clone from production DB, creates users, adds tags, adds instance and cluster level parameter groups, adds roles to DB.
So it doesn’t replicate a specific database from the cluster, it truncates data from the specified DB after replication completes and initiates refilling this data from a separate development database.

Impact

This solution has helped a lot in big releases which needed schema changes in production. It means we can take a fresh copy of data from production within 15 minutes in order to run scripts on the production data. This helps us figure out the downtime needed for deployment.

But a word of warning…

Often development and test environments might not have the same security measures as the production environment. It’s important to mask sensitive information while practicing this approach of having production data in another environment.

Summary

We discussed different ways of having data similar to production in testing and development environments. Every approach has its pros and cons, hence it is advised to make a decision depending on your use case. Though it’s good practice to have data similar to production, this is not the end. We need to take this approach to the next level by having seed data which covers all the cases of production data and many more which aren’t covered with the mentioned approach. This is a midway solution to your goal of having rich seed data for performing testing.