How to migrate AWS RedShift dc2 to ds2 node cluster
This Blog has moved from Medium to blogs.tensult.com. All the latest content will be available there. Subscribe to our newsletter to stay updated.
One of our customers were using dc2.large cluster with 14 nodes in North Virginia (us-east-1) region and they had following two challenges:
- Cost: $3.75 per hour for 14 nodes.
- Space: ds2.large has only 160 GB SSD so it gets filled quickly
Keeping in mind these two challenges, we suggested trying ds2.xlarge 2 node cluster which is a dense storage cluster where each node has 2 TB HDD space.
The new cluster is cheaper, just 1.7 $ per hour and provides more storage space so sufficient for quite some time.
The queries on ds2.xlarge cluster started taking almost double time but our customer is okay so we started migrating to the new cluster.
Creating ds2.xlarge cluster from the existing cluster
- Take snapshot from the existing dc2.large cluster
2. Create a new dc2.large cluster from the snapshot
We can’t directly create a ds2.xlarge cluster from this snapshot as it was taken from dc2.large cluster so first, we need to create another dc2.large cluster and then resize it using classic resize option.
3. Resizing the newly created dc2.large to ds2.xlarge cluster
We can only use Classic resize to change the Node type and during this time the cluster becomes Read-only.
4. Once the cluster new ds2.xlarge is ready, we need to do the performance and functionality testing.
Data migration
Now we have tested the functionality and performance on the newly created ds2.xlarge cluster and ready to do migrate to this cluster.
During the testing time, there were some data got ingested to the dc2.large cluster so we had to migrate that data to the dc2.xlarge cluster. We had all the data present in S3 so we wrote a script using RedShift COPY command to pull the data from S3 to the ds2.xlarge cluster.
Steps:
- Determine the reference point
We need to first identify the point at which both clusters differed, this was not a very easy task with large amounts of the data.
RedShift stores the history of copy commands executed in stl_load_commits and the following query helped us quite a lot.
SELECT query, trim(filename) AS file, curtime AS updated
FROM stl_load_commits ORDER BY file DESC LIMIT 5
This query only worked on the old cluster and on the new cluster it gave empty results so we compared the data present in Redshift tables and the data present in S3 using timestamps to determine the reference point for our data migration.
2. Once we identified the reference point we realized that there are many files to migrate from S3 so we created the manifest files first and then copied the data using the manifest files.
We used the following code to create one manifest per hour of the data.
https://github.com/tensult/aws-automation/blob/master/create-redshift-manifest-file-from-s3.js
This will generate manifest files in the local directory which were later uploaded to S3 bucket using AWS CLI s3 sync command before executing the COPY command on Redshift cluster.
The first few manifest files generated might have few source files which are already copied to the new cluster so please make sure to remove them from manifest file to avoid data duplication using the reference point.
3. Now we need to execute RedShift COPY commands with these manifest files as input and we used followed code to do the same.
https://github.com/tensult/aws-automation/blob/master/redshift-bulk-copy-from-s3.js
You may need to make modifications to “COPY command” section in the above code based on your data.
Based on the amount of data this might take some time so be patient and keep monitoring RedShift COPY command executions from AWS console.
Once the initial data migration is completed, we can start using the new cluster by changing endpoints in all the places where the old dc2.large cluster used.
4. During this data migration, there might be some data which might have got ingested to the old cluster so after we update the endpoints we need a final data migration by comparing stl_load_commits table on the new cluster with S3 bucket.
5. Now the migration to ds2.xlarge from dc2.large cluster is completed but keep monitoring the cluster for some more time.
Conclusion
In this blog, we discussed various steps involving migrating/resizing RedShift cluster. I hope this is useful to you and your work. For more such blogs, kindly follow me here and also feel free to share your thoughts about the blogs in the comments section below.