Data Migration, Or: How I Learned to Stop Worrying and Love the ETL Process

Timothy Kachler
4 min readJan 10, 2019

--

Photo by Ethan Weil on Unsplash

If you’ve spent any fraction of time working in the software or DevOps engineering world, chances are the phrase “data migration” does not conjure up scenes of happiness, tranquility, and nostalgia. (If it does, perhaps you are a wizard/warlock/witch, or you have a very selective memory.) Whether you’re moving data from a legacy on-prem server over to the latest and greatest cloud-based solution, or your company just merged with another and you drew the obnoxiously short straw of integrating the various business systems together, the process of getting data from point A to point B can be tedious and daunting.

We typically refer to this cumbersome process as ETL (“Extract, Transform, Load”). In case you’re new to ETL, here’s a brief explanation of how it works:

Step 1 — Extract:

Extract is the process through which data is plucked and gathered from different locations — commonly referred to as sources. Because very little is simple in the ETL space, these sources can vary in terms of type, size, and complexity. Are you dealing with flat files, such as CSVs, JSONs, or XMLs, or are you navigating SQL or NoSQL databases? Are you handling kilobytes or gigabytes of data? And how does the size of that data impact the amount of time needed to extract it? These are all crucial considerations when it comes to planning and executing the extraction phase of ETL.

Step 2 — Transform:

Transform is the next step. Once the data has been extracted, it is typically processed, or “transformed,” prior to being loaded to its target destination. Naturally, the types, requirements, and scopes of transformations are multitudinous and can range from data validation to cleaning, to deduplication, and so on.

Step 3 — Load:

Now that the data has been transformed in some way, it’s time to load it to its final destination. Load may be the last step in the process, but (are you noticing a theme here?) it is not without its own complexities. Much like the extraction step, you have to account for where the data is being loaded to — for example, a flat file (CSV, JSON, or XML) or a database (SQL or NoSQL).

Hmm, what do I need to consider when selecting an ETL tool…?

Considerations:

All of this leads us to the question: what should we be considering when selecting an ETL solution? There are, in fact, several things you should think about when evaluating different tools and libraries, some of which include:

  • Cost: How much will an ETL solution cost your team in dollars on a monthly, quarterly, and annual basis? Many tools can range from hundreds to thousands of dollars per month, depending on the volume of data you’re processing. Additionally, you should always be aware of potential hidden fees, such as annual support contracts.
  • Onboarding: How much will an ETL solution cost your team in terms of time and human resources? How quickly can your team get up to speed and working with any given ETL product? Unfortunately, many solutions and tools out there are behemoths with steep learning curves.
  • Flexibility: Can the solution be quickly and easily configured to satisfy your requirements? Is it resilient enough to buffer and process like volumes of data? Can repetitive tasks be easily scheduled? Can the task be triggered upon completion of other tasks?
  • Documentation & Support: Perhaps the most important consideration should be, how well does this solution react when we run into an issue? If it’s a paid service, how does support work? Is the documentation adequate? Can you talk to a person about that documentation? If it’s an open-source solution, how well is the GitHub documentation written? Is the code well-commented and easy to understand?

Open-source solutions:

If you’re not working at a Fortune 500 company with a budget to match, what are your options? The open-source community certainly has options, but unfortunately, many of them lack robust or comprehensible documentation, are not actively maintained, or do not have the flexibility to account for the various types and requirements of extraction, transformation, and loading. This gap has compelled us to create our own set of tools known as Kangaru-ETL and RxJS-ETL.

RxJS-ETL is a modular platform built with RxJS observables that allow developers to create stream-based ETL pipelines. The platform offers additional capabilities such as buffering, bulk insertions, notifications upon task completion, job scheduling, and the establishment of task dependencies.

Kangaru-ETL is a user-friendly, cross-platform Electron application designed to utilize the RxJS-ETL library. Those who wish for or need a more visual and guided experience using RxJS-ETL can run Kangaru-ETL to import and export files, connect to databases, write transformation scripts, and set up a queue of scheduled ETL jobs.

Both of these tools have been labors of love, and we certainly hope they can be useful to the open source community.

Resources:

https://medium.freecodecamp.org/sqlalchemy-makes-etl-magically-easy-ab2bd0df928

--

--