Continuously Synchronizing Large Databases From On-Premises to the Cloud (part 1)
Introduction
Stakeholders and product owners often have good ideas for their new product. Fortunately, a lot of the times this can be achieved by using the existing technologies. However, sometimes their imagination goes beyond that, and they lay out a requirement that, at first, seems easy to implement:
I, as a Product Owner, want to continuously synchronize the data from my client’s databases to a cloud instance, so that I can sell a white labeled web application, that is utilizing that cloud persistence as a data store.

This would be a piece of cake if the subject database was small and managed. And if there was one client. The actual truth was the following:
- The subject database is from a 3rd party application and not managed by the client;
- The database schema changes when the 3rd party application gets updated;
- There are multiple clients with the same 3rd party application, but different versions, therefore different database schemes;
- The update cycles of the 3rd party applications are determined by each client individually;
- The database is not small, and is growing in size each day;
That said, I had a bumpy road ahead to analyze and find the best solution to this problem.
Data Analysis
To identify the gravity of the problem and to determine all the risks, I first performed an analysis of the data. The source database, for a known client, is 1 TB on the disk. Continuously performing real time synchronization at this scale simply would not work with any known technology. The good news is that I did not need all the data from this huge database. I needed only a data sample.

This is a total of 66.5 million records from 22 tables, with data from 521 columns, and a disk size of around 40 GB.
This sample might be filtered down to even fewer records, based on certain business rules. However, business requirements do change over time, and records which are not necessary at this point, might become important at a later stage. If the technology permits, it would be a good idea to synchronize unfiltered data, as upgrading or maintenance will be kept to a minimum at later stages.
So, for the purpose of the research and development of the proof of concept, I will try to work with the worst case, and use as much data as possible. This will also serve as sort of a benchmark for the system.
The Problem
Data transfer from a source to a destination always carries its own challenges. To better understand the problem, a good start is to work with common, almost everyday use cases. If we would want to move data between two pieces of hardware on-premises, then certain parameters have to be calculated to determine the amount of time required to do so.
For example, the analysis of moving data from a computer to a USB stick might look like the following:

USB 2.0
- USB 2.0 is specified as 480 Mbps = 60 MBps
- The maximum is 53 MBps if we consider the hardware overhead
- Takes about 18,78 seconds to transfer 1 GB of data
- For my 40 GB data sample it would take 751s or 12.5 minutes
USB 3.0
- USB 3.0 is specified as 5 Gbps = 625 MBps
- The maximum is around 500 MBps if we consider the hardware overhead
- Takes about 2 seconds to transfer 1 GB of data
- For my 40 GB data sample it would take 80s or 1.3 minutes
On-Premises to the Cloud
After seeing the results for two pieces of hardware on-premises — a computer and USB stick, I started searching for options to transfer from on-premises to the cloud.
As a coincidence, my Apple iCloud storage contained around 60 GB of photos and videos backed up from my iPhone. This was pretty much close to the 40 GB size of my database. Therefore, I decided to test out the speed of transferring photos from my phone to the Amazon Photos service. This process took several hours from start to finish.
I sought for the experience of other people for similar use cases, and I stumbled upon this not so encouraging response on an online forum post:

The community obviously displays a dose of justified skepticism when it comes to the maturity, reliability and performance of the network involved in the process of working with a combination of devices on-premises and resources in the cloud.
Though the initial analysis didn’t look promising, we software engineers are problem solvers, and I pursued this challenge to come up with a solution.
I will lay out different technologies and solutions in the following parts of this series.







