Why we Created an in-house Python Solution for Data Migration
Any modern organization, especially a business, creates a lot of data in its day to day functioning. These are usually stored in digital databases called production databases. As you may imagine, these databases are constantly being updated as new data is created. It is in the interest of efficiency to only keep the data we need on a day-to-day basis in the production database, and archive the rest in a separate storage area.
This is where data lakes come in. They are storage spaces where all data is dumped and stored for any future use. This ensures that no data is lost, adding a layer of resilience in case of any database failures. Thus it allows old data to be wiped from production databases for continued efficiency.
An organization also needs to keep track of its data, analyze it, and extract meaningful information from it. This is usually done by creating reports with important information presented in an accessible form, usually through graphs and other visualizations. This process is also automated in high-tech organizations and often involves combining data from various sources and performing the relevant computations.
As you can probably imagine, this is an intensive process, involving lots of operations on data. If this was done by querying the production databases, it would take much longer (because data insertion would be taking up most of the processing time) and would burden the production databases further, potentially causing failures. Any errors in this process could shut the whole production database down, a situation no organization can afford.
This is where Data Warehouses come in. They are databases into which production data is migrated in bulk (either directly from production databases, or through a data lake) at regular intervals. All reporting and analytics queries are performed on this database, leaving the production database free to do what it does best — capture real-time data.
At Cashify, we use Amazon S3 as our data lake and AWS RDS Aurora (PostgreSQL) as our Data Warehouse, to store data for purposes of analytics and reporting.
Earlier, we used Amazon DMS (Data Migration Service) to migrate data from S3 into our Data Warehouse. It caused us recurrent problems:
- Some tables would be arbitrarily skipped, without raising any alert
- Special characters in the data would cause errors in migration
Every error would waste multiple work-hours because the affected tables would need to be erased and uploaded from scratch. It would also hold up the reporting and analytics process in the meanwhile, as data would not be available.
After some futile efforts to correct these issues, we decided to switch our ETL (extract, transform and load) method. We looked at various options, and decided upon AWS Glue as the tool we needed. It is advertised as a data migration solution, and since it belongs to the Amazon ecosystem, we felt it would fit right in.
As it turned out, we ended up abandoning the idea after struggling for a few days:
- Writing a custom python script required using Spark DynamicFrames to manipulate the data, with which we were not familiar.
- Glue did not allow the use of external python libraries (like pandas, which most data scientists working in python use).
- The documentation was too sparse, and not much help was available on StackOverflow either. This slowed down the development process significantly.
Additionally, we needed to create a (paid) development endpoint to try out Glue. All in all, we found Glue to be a bit restrictive and the documentation scanty, mainly because the tool is relatively new.
That said, AWS support helped us a lot when we were trying Glue. It would be fantastic if Glue also allowed the use of external Python libraries and improved its documentation so that businesses could solve their data migration issues end-to-end with a reliable tool that fits into the AWS ecosystem.
We next decided to create our own script using Python and deploy it on an AWS EC2 server. That would give us the flexibility we needed in terms of the libraries we could use and how we could schedule the migration.
The major libraries we used are:
- boto3 — to connect to S3
- psycopg2 and SQLAlchemy — to connect to Data Warehouse and execute queries
- pandas — for data manipulation
- csv — to prepare the data for insertion into the Data Warehouse
- Business data is generated and stored in production databases in MySQL and MongoDB.
- Since S3 is our data lake, all data in production databases is migrated into S3 as CSV files. This is done by creating jobs on Amazon DMS for each database table, which execute hourly.
- Next, we need to migrate the data we need for reporting and analytics into our Data Warehouse. For this, we now use our custom python script. It fetches CSV files from S3 and loads them into the Data Warehouse.
- MongoDB data is stored as JSON in the CSVs. To migrate to the Data Warehouse, these JSON fields need to be opened up (flattened), i.e. converted into table columns so that they can be inserted into a SQL table. We do this using pandas and the python json library.
- In the Data Warehouse, each table’s data is first stored in a temporary “staging” table. Later it is migrated into the main table. We do this because pandas loads the CSV from S3 as “object” type fields, not the data types required by the Data Warehouse tables.
- Finally, the data in the staging table will be typecast into the required types and inserted into the main table.
- The temporary “staging” table is deleted. At this point the migration is complete.
I will cover the technical details and the issues we encountered in this project and how we solved them, along with code samples, in a subsequent article. Stay tuned!