How to migrate AWS RedShift dc2 to ds2 node cluster

Reference: https://bit.ly/2TXLkBu

One of our customers were using dc2.large cluster with 14 nodes in North Virginia (us-east-1) region and they had following two challenges:

  1. Cost: $3.75 per hour for 14 nodes.
  2. Space: ds2.large has only 160 GB SSD so it gets filled quickly

Keeping in mind these two challenges, we suggested trying ds2.xlarge 2 node cluster which is a dense storage cluster where each node has 2 TB HDD space.

The new cluster is cheaper, just 1.7 $ per hour and provides more storage space so sufficient for quite some time.

The queries on ds2.xlarge cluster started taking almost double time but our customer is okay so we started migrating to the new cluster.

Creating ds2.xlarge cluster from the existing cluster

  1. Take snapshot from the existing dc2.large cluster
Taking a snapshot of existing redshift cluster

2. Create a new dc2.large cluster from the snapshot

Select Restore from Snapshot to create a new cluster
Create a new dc2.large cluster from the snapshot by selecting a different cluster identifier
We can’t directly create a ds2.xlarge cluster from this snapshot as it was taken from dc2.large cluster so first, we need to create another dc2.large cluster and then resize it using classic resize option.

3. Resizing the newly created dc2.large to ds2.xlarge cluster

Resize the newly created cluster to convert to a ds2 cluster
Select classic resize to change Node type to ds2.xlarge
We can only use Classic resize to change the Node type and during this time the cluster becomes Read-only.

4. Once the cluster new ds2.xlarge is ready, we need to do the performance and functionality testing.

The new ds2.xlarge cluster is ready and healthy

Data migration

Now we have tested the functionality and performance on the newly created ds2.xlarge cluster and ready to do migrate to this cluster.

During the testing time, there were some data got ingested to the dc2.large cluster so we had to migrate that data to the dc2.xlarge cluster. We had all the data present in S3 so we wrote a script using RedShift COPY command to pull the data from S3 to the ds2.xlarge cluster.

Steps:

  1. Determine the reference point

We need to first identify the point at which both clusters differed, this was not a very easy task with large amounts of the data.

RedShift stores the history of copy commands executed in stl_load_commits and the following query helped us quite a lot.

SELECT query, trim(filename) AS file, curtime AS updated
FROM stl_load_commits ORDER BY file DESC LIMIT 5
History of copied files from S3 using the COPY command
This query only worked on the old cluster and on the new cluster it gave empty results so we compared the data present in Redshift tables and the data present in S3 using timestamps to determine the reference point for our data migration.

2. Once we identified the reference point we realized that there are many files to migrate from S3 so we created the manifest files first and then copied the data using the manifest files.

Datewise data organization in S3

We used the following code to create one manifest per hour of the data.

https://github.com/tensult/aws-automation/blob/master/create-redshift-manifest-file-from-s3.js

This will generate manifest files in the local directory which were later uploaded to S3 bucket using AWS CLI s3 sync command before executing the COPY command on Redshift cluster.

RedShift manifest files in S3
The first few manifest files generated might have few source files which are already copied to the new cluster so please make sure to remove them from manifest file to avoid data duplication using the reference point.

3. Now we need to execute RedShift COPY commands with these manifest files as input and we used followed code to do the same.

https://github.com/tensult/aws-automation/blob/master/redshift-bulk-copy-from-s3.js

You may need to make modifications to “COPY command” section in the above code based on your data.

Based on the amount of data this might take some time so be patient and keep monitoring RedShift COPY command executions from AWS console.

Monitor RedShift COPY command Executions

Once the initial data migration is completed, we can start using the new cluster by changing endpoints in all the places where the old dc2.large cluster used.

4. During this data migration, there might be some data which might have got ingested to the old cluster so after we update the endpoints we need a final data migration by comparing stl_load_commits table on the new cluster with S3 bucket.

5. Now the migration to ds2.xlarge from dc2.large cluster is completed but keep monitoring the cluster for some more time.

Conclusion

In this blog, we discussed various steps involving migrating/resizing RedShift cluster. I hope this is useful to you and your work. For more such blogs, kindly follow me here and also feel free to share your thoughts about the blogs in the comments section below.