One Time Data Migration To Cloud
Big Step in Cloud First Approach
One of the difficult and most important step when you move to cloud and start doing well as a team by doing Cloud-First implementation strategy is migrating the legacy data / database to cloud. In our team we did hit such a milestone after our initial success in Cloud (AWS). The difficulty in coming up with a optimal solution for doing the data-migration depends on the maturity level of the team in Cloud solutions and adequate data-engineering experience. We were moderate players in both the above, so had to take suggestions from other teams and experts who had done similar exercise. We had to migrate Mainframe IBM DB2 database to Aurora Mysql
Too Many Options
There are way too many options when it came to data migration. Some of the interesting thoughts and tools I came across:
- AWS Data Migration Service — Very good option and a managed service from AWS itself that spins up EC2 instances to migrate data from Source DB to Target AWS DB. Unfortunately it didn’t support Mainframe DB2 database as source at that time and most importantly had to open up firewall for the DB2 database port at the on-prem data-center to connect from the DMS service to DB2 :( (which is too difficult in an enterprise)
- Debezium — Highly scalable CDC tool that is built for exactly these kind of use-cases. Unfortunately legacy systems aren’t very flexible and also follows shared tenancy model where it hosts multiple business databases in one IBM server. We only owned piece of the pie, as Debezium DB2 documentation recommends database server level configuration changes, we didn’t even move further on this step. Interesting thing to look into if you can do Database server level changes is the debezium-server (standalone tool built for cloud-migration, supports multi-cloud pub-sub pattern)
- Kafka-Connect JDBC Source — We have an enterprise kafka-cluster that has kafka-connect capabilities and more importantly has cloud replication capabilities. All it means for us is to setup multiple JDBC source connectors that would connect to DB2 database and move the data into Kafka topics which will be then be replicated to cloud, by means of mirrioring. Once the data is in cloud, it can be consumed in many ways. Here the only downside is the cost-factor. Cost of setting up 50 topics (equal to table count) and cost of replication to cloud. This seemed too pricey for a one time migration (Performance is also a concern w.r.t kafka-connect as it queries the tables by offset and moves data instead of reading database bin logs unlike the other 2 options. For us this was not a concern as our data was around 30 Gb and was manageable with kafka-connect)
Our Approach
To understand the actual working of the above listed options, I started doing POC on the Debezium & Kafka-Connect. With Debezium, I couldn’t go too far as hit a snag with Db2 server changes, so I ended up creating a dockerized db2 instance and doing a CDC on it, for learning’s sake. It worked (no surprise :) )
Moving on to the kafka-connect solution. At enterprise, its always an established kafka cluster there is a dedicated administrator and scripts to get things done. But when I wanted to run on my own, I initially chose the landoop image. Worked pretty well for initial runs, but confluent-local seemed like a much better option as it was latest release from confluent and made life much easier. I would have preferred a docker, but a tar installation was not bad either as it’s well documented and stable too.
Below is the component flow diagram. Confluent-Connect is the center of it all running on linux VM within the data-center. Makes sure that the kafka-connect has no firewall restrictions to the DB2 database and also if possible get a read-only credentials dedicated to the migration job (you don’t want to lock your service account like I did :))
With regards to the Sink-Connector, I could have also tried the JDBC sink connector, but considering the data can be persisted in S3 for me to review in case of transformation issues, data type issues, etc. led me to choose S3 connector over JDBC connector. Also I could use this opportunity to fix any limitation in the DB2 database like trialing spaces.
Setting up Kafka Connect
Follow instructions on the confluent website, install kafka connect and start the local server. It starts up several components that run on different ports, if you’re on behind a network firewall, make sure the ports used by confluent kafka does not clash with any firewall ports
Once installed, add JDBC Source connector and S3 Sink connector from conlfuent hub. You can either install through commandline or do a manual install. D the jdbc driver for db2 and add to classpath as shown below
confluent-hub install confluentinc/kafka-connect-jdbc:10.2.0
confluent-hub install confluentinc/kafka-connect-s3:10.0.0
# add the db2 jdbc driver JAR to classpath
# either add /share/kafka/plugins/confluentic-kafka-connect-jdbc (or) /share/java/kafka-connect-jdbc# Restart if needed for the class path to reflect
DB2 JDBC Source Connector
Time to deploy the JDBC source connector. I just followed the connector documentation and this deep-dive blog helped figure out the use of each parameter and I used them as needed. I have used bulk mode with a big time-interval so that I can load entire table to a topic. Topic name is same as the table, you can customize if needed
JDBC source connector queries the table by doing a “select *”, you can customize the query if needed. I combined bunch of smaller tables in 1 connector and dedicated a connector config for bulk tables. Since we’re executing in the local single node setup, tasks.max of 1 made sense per JDBC source connector. I did experiment with more task number didn’t really help with performance (I was happy with the speed considering I ran on a single node) . If the configuration were right, you should start seeing topics getting created with appropriate schema. you should see 40 topics for 40 tables.
JDBC source connector queries the table for the count at first and the count value will be End
column above. For every successful fetch from DB2, the offset catches up as new records are loaded from DB2
S3 Sink Connector
The S3 sink connector can be started in parallel to the JDBC source connector. Again the documentation is gold, helps you with all required inputs. Pay attention to Credentials Provider, I used the profile based credential provider and setup credentials in the ~/.aws/credentials
file (Make sure you have appropriate write permissions in IAM and KMS access if bucket is encrypted)
key config to note —
So What Next? LAMBDA!!!
When it came to the S3 → Aurora data writing, natural choice was a lambda. And with S3 event sourcing trigger, lambda gets invoked for each file inserted in the S3 bucket. What runtime to choose? I didn’t debate or think twice here, my natural inclination was towards Python
🐍. I have already used Python for many data migration jobs using dynamodb, pandas, numpy, so I thought if I come up with something similar I can write to Aurora also.
Pre-Requisite:
- RDS is deployed in VPC, make sure your lambda is in VPC
- Don’t bother trying to package numpy and pandas with your code use this public layer — https://github.com/keithrozario/Klayers (pandas ARN)
- Make sure the aurora database tables are created, lambda has the right IAM role (or) credential secret access to connect and write to database.
- JDBC source connector converts dates and timestamps into interesting number formats, I had to spend sometime figuring the knot, but it worked out well. Also pandas considers all columns with numbers as string, for example Zip-Code. so it strips of leading zeroes
- More trial and error depending on your data, so be patient 😌
My Source-Code, not the perfect of python code you will ever see. But does the job well, can think of introducing schema validation — managing input format as avro instead of JSON etc. For now I didn’t find time to make those changes yet. But can be easily modified to do all that. Add below to requirements.txt (aurora mysql connection) apart from the pandas layer for python-3.8
pymysql=1.0.2
SQLAlchemy=1.4.2
Conclusion
For migration of a 30 GB database, above process took around 4 hours (Source, Sink and writing using lambda). Considering the approaches I had analyzed and the Big Bucks 💰 for setting up 50 topics and also paying for Kafka-Connect 💰, above exercise was a saving grace. It’s not the perfect, and may not suite everyone’s needs. Definitely helped solve my problem 😇
Potential Drawbacks could be the limited kafka partitions, nodes, failover, etc. So prepare your failure handling mechanisms and logging in advance so that the failures could be handled gracefully