DB Migrations through CSVs

Pedro Ismael Jara
Wolox
Published in
9 min readDec 6, 2018

--

When faced with the problematic situation of migrating a DB we all start panicking because we know we have a complex task ahead of us and usually the biggest problems we face are:

  • Compatibility between origin and destination DB
  • Data validation
  • Information adaptation to the new schema
  • Files migration

Most of us get stuck even in the first step of the process, I was confronted with the task of migrating an old DB, and even though I could find tools that were capable of adapting the origin DB to the destination DB, they would not solve any of all the other problems that migrations face.
After a lot of thought, it occurred to me: Why migrate through SQL? Why not use the power of my application (App) for all the data validation and all the information adaptation into the new schema?

I was much closer now, if I used the DB conversion tools I would only need to connect my App to both DBs and use my App for moving data from one end to the other, but this would require SQL knowledge of which I did not have plenty and I wanted to use the power of Rails to its fullest, there had to be a nicer and cleaner situation that would exempt me from SQL altogether and would only require me to use the knowledge of the field I’m more adept with. After several hours of trial and error, I thought “Why not skip the DBs compatibility issue and just use some universal medium?” So that’s how I ended up exporting all the tables that I needed into CSVs that could be easily parsed by my RoR application, running all my model validations through the easily adapted information to the new schema.

So now we have a universally adaptable method for any DB type to any DB type through any App we utilize:

First Steps

Before we begin any migration, the first step we should realize is to take the time to study the structure of both DBs, both origin, and destination, as to have a clear plan we can follow when we start migrating each table. We should be able to answer questions like:

  • What’s the information this table contains?
  • Is this relevant/important?
  • How does this table translate into my own database?
  • What do the columns this table contains represent in my own schema?
  • How does this registry relate or interact with the rest of the DB?
  • Do I need to maintain those relationships?

With this information, the following steps will prove to be really simple, otherwise, migrating a DB with a structure we do not comprehend into another DB we are not familiar with, will be catastrophic.

Having finished a plan where we can see the data we wish to migrate and what it represents in our own structure, we can start exporting each individual table into their own respective .csv. In this case, we used csv due to the Ruby CSV libraries that allow a simple and effective processing and it’s a format very easy to export from any DB, but note that you could use any format that your App can interpret and your DB can export to.
There are several ways you can achieve this, through console commands or your favorite visual tool. Here you can find some examples you can follow:

https://support.spatialkey.com/export-data-from-database-to-csv-file/

But I have some tips you should keep in mind that will help with the whole process:

  1. All filters you can easily perform through SQL, do it in the same query that will export the table to CSV so you don’t have to check those conditions when we are processing this file.
  • Things such as ‘status = approved’, ‘type = 2’, etc
  • Where x_column != NULL

2. Don’t overload your CSV with unnecessary columns or information you do not require in your migration

3. Depending on the data you are going to migrate, you should consider using special characters such as ‘ | ‘ and ‘¬’ for your column and fields separators instead of the default ‘,’ and ‘;’ so it does not cause conflicts with the information that is present in the CSV

4. Save those queries you are using to export each individual table with the required fields and the WHERE clauses you need, somewhere safe, you never know when you might need to revisit or re-use that code!

  • For example, we could have a query that brings us all the active Pokemons as those might be the only ones that we should migrate:

Now if we need to revisit that table, we know the columns we migrated and the filters and models we exported.

Now that we have a beautiful CSV for our App (in my case RoR) to parse, we can begin the process of moving that information to our DB through the power of magic and computers.

The Data Migration Process

Processing the CSV could be considered the hardest part of this process, but with a few tips and guidelines, you’ll find it’s really simple and easy, I’ll also be providing some examples along the way that you can copy and adapt to your own needs. In my case, I’ll be using Ruby version 2.3.4, Rails version 5.1.3, and the gem pg 0.18 so Ruby can communicate with our PSQL DB.

During the development of this part, we will use a simple example. We have a DB in SQL Server that we want to transfer to Postgres. It contains some Pokemons with a name and an id that will be carried over to the new schema that has the same structure.

  1. First, we’ll start by creating a rake task in lib/tasks/your_task_name.rake
  2. By using the following command:

$ rails g task your_task_name my_task
# create lib/tasks/your_task_name.rake

And to run the code in our task, we will simply do:

$ rake your_task_name:my_task

  1. Now, we will add the logic to this task. We should decide if we want the whole table migration to rollback after a single record fails the validations of our models or if we will take note of the failed records and migrate them individually afterward.
  2. Let’s use our example introduced above with a DB which contains some Pokemons. Our target schema could be this one:
  1. If you choose the first option, we would have code similar to this:

Following this method, we make sure our DB remains clean and consistent, and the migration only completes when it’s guaranteed to finish correctly.

  1. Otherwise, log each record that fails the validations so you can check that information when it finishes:

Remember, once you finish running this task for the first time, you’ll have to check the logs for all records that migrated unsuccessfully to run through the process again with a reduced dataset that contains only the failed records and the fixes for the errors that occurred while migrating.
Depending on the consistency of your information and the number of records, this is a very valid alternative, having 10 records fail from 100.000, then fixing those particular cases and running the migration task again with the reduced data set is very simple compared to waiting hours for a task to finish, only to fail at the last moment and having to run it all over again.
Although it’s highly recommended that you log the failed resources that fail in the first scenario with a complete log trace to perform the necessary fixes in your task, validations or information.

  • After you finish migrating a table, make sure the DB’s counter for IDs is set accordingly to start from where you left off:

Leaving us with the following complete task:

This will make sure, all models you instantiate for that class, later on, will not try to step on the IDs you forcibly occupied with the migration.

Now we are ready to run our rake task by doing typing the namespace:task. In our case example, that would translate to:

$ rake pokemon_migrate:import

Tips:

  • Maintain the IDs if possible! If for some reason the IDs the records you want to migrate are occupied, you could add a specific number to all the IDs of the migration (like 100.000) so you know all the IDs that start from that number are the records product of your migration. This will ensure all the relations are easily maintained when migrating future tables.
  • Remember to start from the top model that does not depend on any other to instantiate. For example, if we are migrating the towns in Pokemon, we would first start by migrating the regions, then the towns that belong to that region and then the Pokemons that belong to that region/town.
  • If you have enough resources in your server and the tables you are migrating are not codependent, you can run your migrations simultaneously, drastically reducing the time you need to finish the migration.
    Keep in mind that if you are in running case a. pokemon_rollback_migration.rake the size of the transaction will also impact on the resources of your DB’s Server!

But what about migrating files? Pictures? Buckets? Can we utilize this method for that too? Well of course! Not only that, I would highly recommend that you do. And on the next section, we will go in-depth with the explanation on how to do it and examples that you can follow and adapt to your own needs.

Migrating Files

When it comes to files, our applications could have a lot of validations, some which may not be present in the DB, and the tools required to do this kind of migration can be really hard to implement with SQL or other DB migrations software, but if the medium is our own application we can utilize all the tools provided by our own App.

Let’s see this more clearly with an example, remember our Pokemons? It looks like they have profile photos too:

Here we are using CarrierWave as uploader::

https://github.com/carrierwaveuploader/carrierwave

And we have the following table in our origin DB that links a Pokemon to its ProfilePhoto:

ProfilePhoto
id : Integer
pokemon_id : Integer
url : String

We need to find the Pokemon associated with the profile photo and then migrate its picture to the one linked by the URL.
Depending on your situation and requirements, you have two routes you can take, each with its own pros and cons:

  1. With the URL of the file’s location, you can download and re-upload the file to your own S3 bucket, running all the photo validations and versioning. Carrierwave by default has a method that can be used to download pictures located remotely in an URL (remote_profile_photo_url) and then upload it to our own bucket, so we will take advantage of that to upload the photos located in the old bucket to our new one.

Pros: Run all validations on the pictures you are gonna upload to your application. Run the versioning for all the sizes you might need of each photo. Use the new storage or bucket if you have one.

Cons: Slow and very high resource consumption.

2. With the URL, migrate the column to link to the original file location.

Pros: Quicker process and not very consuming when it comes to resources

Cons: The original location might have corrupted files or might not have all the versions you require for the photo. Must update all URLs for each version of the photo and must ensure there is an URL for each version. Might have to recreate the versions of the photos anyways.

Maintains the original location of the photo. Maybe you do not have admin access to the holding bucket.

By now, you’ll have learned that DB migrations don´t have to be a very complex and hard process, you don’t need to learn how to use any migration software, SQL, etc. We can manage to migrate an entire DB and it’s filed, no matter the origin and destination, by using the things we already know and have at our disposal.

I urge you to go ahead and try it for yourself, to give me your opinion and feedback and to add your expertise and knowledge so we can improve upon this.

--

--