Migrating data from one Amazon Redshift database to another using Amazon S3
There are a lot of ways to move data from database to database using Amazon Redshift, but one of the most efficient ones is the use of COPY and UNLOAD commands, these commands allow you to move data between databases almost seamlessly. This tutorial will show you the steps to move tables from one Amazon Redshift schema to another.
1. Create a S3 folder to store the unloaded tables
To be able to use the UNLOAD and COPY commands effectively we need to make use of the Amazon S3 service, create a S3 folder and have an IAM role with permissions to access Amazon S3. The S3 folder is going to be used as a bridge between the two Amazon Redshift databases.
If you don’t have permissions to create an IAM role to access Amazon S3 try to talk with the infrastructure or DevOps team of your organization so they can create it for you.
2. Selecting the tables to unload
For unloading the tables you need to migrate it is convenient to run a couple of queries before hand in your source database to make sure you are unloading the right data, also keep in mind what fields your query returns so you can use them on the COPY command.
An interesting advantage of the the UNLOAD command is that you can use a query instead of selecting a specific table to be loaded on S3, this has several benefits like the use of UNION statements and JOINS to different tables, this is why the UNLOAD command can be used pretty much like an ETL tool and can be very powerful on automated environments.
3. Using the UNLOAD command
One of the best ways to load tables from Amazon Redshift to Amazon S3 is the use of the UNLOAD command. The UNLOAD command uses a SQL query, a S3 path and an IAM role with permissions to access Amazon S3 to load the result of a query into a S3 folder. Here is an example of how the command looks like:
The ‘allowoverwrite’ parameter help us to overwrite the files that we create every time we use the command on the same S3 folder, this is useful for certain ETL processes where you need to clean and re-create your data. The ‘format as csv’ part forces the unload command to generate files with ‘comma separated values’ instead of the default format that is separated with ‘pipes’ (|) . Also make sure that the S3 path in the command finishes with a slash (/), this is to avoid unloading the files on the parent folder.
Once the UNLOAD command is executed in your source database, you can check the unloaded files on the folder you specified before, usually the UNLOAD command creates several partitions (files) of your data and doesn’t provide the ‘.csv’ suffix to them.
4. Creating the destination table
To be able to copy data from Amazon S3 to Amazon Redshift we need to have a schema and a table created on our destination database, we have to make sure that the structure of this new table (data types and column names) is the same as the table we ‘unloaded’ the data from.
To create the new table on the destination database we can make use of a simple ‘CREATE TABLE’ statement, like this:
If you are using a database administration tool like DBeaver, you can generate the ‘CREATE TABLE’ statement from the source table by right clicking the table, select “Generate SQL” and then select “DDL”, this would show you a dialog with the ‘CREATE TABLE’ statement on it, you can copy it and execute it on the destination database to create the table.
5. Using the COPY command
The COPY command allows you to move from many Big Data File Formats to Amazon Redshift in a short period of time, this is a useful tool for any ETL process. We are going to use this COPY command to ‘copy’ the data we loaded previously with the UNLOAD command, moving the data we have on our Amazon S3 folder to our destination database.
Once your destination table is already created you can execute the COPY command, this command uses the schema following the name of your table, the fields you want to copy, the path to your S3 folder, the IAM role with access to Amazon S3 and the format of the files you are copying from (CSV on our case). The COPY command should look like this:
Once the COPY command is executed the data that you are migrating from the source database should appear on the new table, try to verify the data using a simple query (‘select * from your_schema.your_table’) just to make sure that all the data is there. If any of the commands is failing or generating permission errors, it is very likely that the IAM role that you are using doesn’t have permissions to access Amazon S3 files.
Conclusion
It is not always evident what tools should we use to migrate data from database to database when we are working with Amazon Web Services, we can get entangled on the variety of different tools and services that Amazon provides making migrations more complicated than they need to be, that is why we always need to strive for simplicity when we are looking for a good solution (Occam’s razor, the KISS principle).