Table of contents
- Why serverless approach for Database migration and replication?
- Prerequisites — ETL + AWS DMS + AWS RDS
- A Step-by-Step process and demo
- BONUS — Pro Tips!
1: Why serverless approach for Database migration and replication?
In one of my previous blogs (link attached below), I have already mentioned why the Serverless approach is always a good choice in terms of Operational Excellence, Security, Performance, Reliability, and Cost.
Tip: These are also called 5 pillars of the Well-Architected Framework. In my own experience, one should strictly take care of Performance & Cost. As these all 5 pillars are synergies to each other, among which Performance & Cost is the most important.
Amazon S3 Data Lake | Storing & Analyzing the Streaming Data on the go | A Serverless Approach
Making an Amazon S3 data lake by storing the streaming data & analyzing it on the go in near real-time…
So, A popular open-source Change Data Capture (CDC) platform called https://debezium.io/ is available for migrating and replicating CDC, which is amazing and I also personally used it. But, In terms of 5 Pillars of Well-Architected Framework it has the following disadvantages:
- Operational Excellence: Global read lock on source Database, Need operational maintenance, Debezium lag monitoring, Server maintenance that runs the Debezium, etc.
- Security: Not managed security both for encryption at flight and encryption at rest.
- Performance/Scalability: Can easily scalable via Docker. But auto-scaling is cumbersome at times.
- Reliability: Having several internal dependencies over Zookeeper, Kafka, etc which makes it less reliable.
- Cost: It’s free but, maintenance cost is there!
Hence, I will use a managed and serverless process for migration and replication using AWS DMS Service and will show you a step by step demo for setting this up. But it is having its own disadvantages which, I mentioned at the last in the BONUS section and ways to handle that also!
2: Prerequisites — ETL + AWS DMS + AWS RDS
2.1: ETL: ETL stands for Extract Transform Load. Usually, ETL data pipelines are used for giving structure to unstructured data.
But, we can also use for migrating a structure data source from one to another either relational or a non-relational database. And we are going to this now but not from scratch but in a more or less serverless way.
2.2: AWS DMS: Amazon Web Service (AWS) Data Migration Service (DMS) is a service that we use for setting up a full fledge ETL process. Usually setting up an ETL process is easy but maintaining an ETL process is non-trivial due to several dependencies of Docker, Debezium, Kafka, etc.
If you want to learn it from scratch, below is my article regarding that, you can check that out below, the process of ETL I have discussed in the below article is the same as we are going to run our replication instance (explained later).
If you want to learn it from scratch, below is my article regarding that, you can check that out below, the process of ETL I have dicussed in the below article is the same as we are going to run our replication instance (explained later).
4 Easy steps to setting up an ETL Data pipeline from scratch
Setting up an ETL pipeline within few commands
Now, We know why we need AWS DMS. So, AWS DMS compromises of 3 components majorly:
- AWS DMS Endpoints: These are the source and target endpoints. While setting this up, one should take care of VPC networking esp for Replication instance which needs networking with source, target. AWS also provides the option for testing this.
- AWS DMS Replication Instance: Simply an ec2 instance, which will run our entire ETL process. Replication instance, stores the CDC (Change Data Capture) checkpoints, table histories for a specified duration (so to reload later is we stop the replication) and also under the hood does all the heavy lifting for us. Below attaching my reference Github repo for better understanding.
- AWS DMS Replication Task: ETL task that will run inside the replication instance. Just for learning and understanding purposes, below attaching the code of setting up ETL using Debezium CDC, AWS uses the same methodology for data migration and replication.
3: A Step-by-Step process and demo
In this demo, I will take a database hosted on an ec2-instance as a source endpoint and AWS RDS as a target endpoint.
Step 1 Setting up source and target endpoints: Attaching below configurations below.
Note: Use connection attributes for setting parallelLoadThreads for loading the table parallelly or useBCPFullLoad for the Bulk load. You can change it as per your needs. They are important and some of the connection attributes are as follows:
- safeguardpolicy — Changes the behavior of SQL Server by opening transactions to prevent the transaction log from being truncated while AWS DMS is reading the log. Valid values are EXCLUSIVE_AUTOMATIC_TRUNCATION or RELY_ON_SQL_SERVER_REPLICATION_AGENT (default).
- useBCPFullLoad — Directs AWS DMS to use BCP (bulk copy) for data loading. Valid values are Y or N. When the target table contains an identity column that does not exist in the source table, you must disable the use of BCP for loading the table by setting the parameter to N.
- BCPPacketSize — If BCP is enabled for data loads, then enter the maximum packet size used by BCP. Valid values are 1–100000 (default 16384).
- controlTablesFileGroup — Specifies the file group to use for the control tables that the AWS DMS process creates in the database.
- parallelLoadThreads — Discussed above.
Step 2 Setting up replication instance: Attaching below master configuration in step 3.
Note: If your source and target let’s say staging and production are in different VPC then replication instance won’t work unless connected by a VPC Peering connection both the sides.
Before moving ahead please run the test for connectivity between the endpoints and the replication instance manually.
Step 3 Setting up the replication task: This is the most important part because, if the replication task configured incorrectly, it will easily go into Errorred state and then you need a monitor with the task using Cloudwatch logs. Below is attaching the configuration template and screenshot.
For all the configurations, refer the below master configuration file:
Use the Table Mapping below under the Advance setting:
Tip: You can use AWS CLI can also be used for creating the replication instance and task using the above config.
As you can see all the classic models have been migrated and will start replicating CDC based on the configuration selected above!
4: BONUS — Pro Tips!
4.1: DMS Limitations (we encountered 😤):
The AUTO_INCREMENT attribute on a column was not migrated to a target database column.
Indexes created on a portion of a column value aren’t migrated. So, secondary indexes, default values, procedures, triggers, etc. as part of the data migration was not migrated.— The reason is AWS focusses on a minimalistic approach (only objects) for data migration.
DMS didn’t migrate the default values.
So, a good solution for migrating the indexes is to dump the schema and start the replication task with Truncate only option, which will preserve the dump schema (with AutoIncrement, Indexes, Default values) and migrate and replicate the CDC only.
Aliter’s best solution is to migrate the indexes manually if you realized this after the data migration process entirely. This trick works fairly well also!
Another limitation is If the MySQL-compatible source is stopped during full load, the AWS DMS task doesn’t stop with an error. The task ends successfully, but the target might be out of sync with the source. If this happens, either restart the task or reload the affected tables, that’s the only solution.
4.2: Cost: Usually for data migration up to 1–2 TB t2.medium or t2.large works fairly well and costs around 45–50 USD/month. Increasing the instance type will only help in initial data migration but the replication speed remains the same in all instance types. Try not to go for a much higher instance type, it will only cost you more!
AWS DMS is billed by the hour if you choose for replicating ongoing CDC changes after migration then only you prefer to choose validation changes (only if you need). Because enabling validations will substantially increase the migration time and hence more cost!
That’s pretty much of it!
I hope you will find this blog helpful. Stay tuned! for more such blogs state of the art Tech. Thanks for reading :)
Big Data & DevOps:
- Towards Data Science Publication — 4 Easy steps to setting up an ETL Data pipeline from scratch — https://bit.ly/37ljWVI
- Towards Data Science Publication — Amazon S3 Data Lake | Storing & Analyzing the Streaming Data on the go | A Serverless Approach — https://bit.ly/397bfyF
- DEV Community — 5 Best practices of using GIT, that every software developer should know — in 5 minutes — https://bit.ly/3boXIEH