Data Migration and retrieval with AWS

Manush Parikh
ScaleCapacity
Published in
9 min readJul 9, 2020

Migrations to AWS include moving any workload from an on-premises environment, hosting facility, or other public cloud. AWS is working with thousands of organizations to migrate workloads such as applications, websites, databases, storage, physical or virtual servers, or entire data centers.

AWS Architecture

Overview:

To migrate on-premise MySQL database using DMS to S3 and further add that data to S3 glacier using Lambda function. Retrieve the data from the archive using the lambda function and store it in S3.

Data Migration (on-premises to AWS Cloud)

You can use AWS Database Migration Service (AWS DMS) to migrate data from various sources to the most widely used commercial and open-source databases. AWS DMS also supports Amazon S3 as a source and as a target for migrations. When you use Amazon S3 as a target, you can use AWS DMS to extract information from any database that is supported by AWS DMS. You can then write it to Amazon S3 in CSV format, which can be used by almost any application.

Services used:

The services that we will discuss in this article are —

  1. Amazon RDS
  2. Database Migration Service
  3. Amazon S3
  4. AWS Lambda
  5. Cloud watch logs

Amazon RDS

Amazon Relational Database Service is a distributed relational database service by Amazon Web Services. It is a web service running “in the cloud” designed to simplify the setup, operation, and scaling of a relational database for use in applications

Database Migration Service

AWS Database Migration Service helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. The AWS Database Migration Service can migrate your data to and from the most widely used commercial and open-source databases.

Amazon S3

Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its global e-commerce network.

AWS lambda

AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.

Cloud watch logs

CloudWatch Logs enables you to centralize the logs from all of your systems, applications, and AWS services that you use, in a single, highly scalable service. You can then easily view them, search them for specific error codes or patterns, filter them based on specific fields, or archive them securely for future analysis.

BASIC STEPS:

Step 1: Create Roles for the same user as follows:

  • Lambda execute.
  • It Should have access to a query from the glacier and put the object to S3.
  • LambdaToGlacier
  • It should have access to Cloud Watch, S3, S3 Glacier

The policies that are should include.

  • DMSToS3
  • It Should have access to put objects in S3.

Step 2: IAM user policy

1. Create an IAM User with the following policies:

  • DMS VPC Role
  • DMS Hub Creation
  • Lambda Function Full access
  • RDS Full access
  • S3 Full access
  • S3 Glacier Full access
  • Cloud Watch Logs Role
  • IAM Permission (Github)

Step 3: Create an S3 glacier vault:

  1. Sign in to the AWS Management Console, and open the Amazon Glacier console. Create a vault with the name rdsarchive in the us-west-2 region. Choose Do not enable notifications, and then choose Submit.
  2. Note the name and Amazon Resource Name (ARN) of the vault.

Step 4: Create an RDS Instance:

  1. Make a Parameter group and attach the RDS instance with this group.

a. This group should have “binlog_format= row”:

b. It Should also have “binlog_checksum= NONE”:

2. Create an RDS Instance with MySQL.

  • The Instance should have public access so that the details can be added with MySQL client.
  • Choose the parameter group that we have created.

3. Enable binary logging in the RDS using this command in the MySQL client:

call mysql.rds_set_configuration(“binary retention hours”, 24);

Step 5: Create a folder in the S3 bucket:

Create the following folder under the bucket:

  1. The output folder, which acts as a target for Amazon Glacier Select to retrieve the data from the archive.

Step 6: Create a Lambda Function to upload the data to the glacier

  1. On the AWS Lambda console, create the Lambda function as follows:
  • Use the blueprint s3-get-object-python with the Python 2.7 runtime.
  • For the function name, type S3toglacier.
  • Choose the existing LambdaToGlacier role for this function.
  • Type the bucket name.
  • For the event type, choose Object Created (All).
  • Specify the prefix that you want to choose the data from as the prefix.
  • Select the Enable trigger checkbox.
  • Create the function.
  • After it is created, edit the function.
  • Replace all the code in the window with the Python code from the Github.

2. In the Environment variables section, specify the vault name as Name of the vault that you created.

3. Increase the Lambda timeout to 5 minutes because the upload of the archive might not complete within the default timeout settings.

Save the changes.

The preceding code fetches the S3 object, downloads it to a Lambda instance, and uploads it to the Amazon Glacier vault. This function takes the Amazon Glacier vault name through the environment variables.

Step 7: Create a DMS replication instance.

  • Create a DMS instance in the same VPC in which the RDS instance was created.
  • Use the default security group for the instance.

Step 8: Create Endpoints for the DMS replication instance.

  • Open the AWS DMS console, and choose Endpoints in the navigation pane. Create a source endpoint for the MySQL database using the RDS instance that you created in Step5, as shown following:
  • Make sure that the database name, port, and user information are correct.
  • Choose Run test and test the endpoint. If the test succeeds, be sure to save the endpoint. If the test fails, verify the endpoint and security group configurations.
  • Choose Refresh schema after successful testing.
  • Create a target endpoint for the S3 connection, as shown following:
  • Specify the role ARN of dms-vpc-role as arn:aws:iam::<AWS Account>:role/dms-vpc-role.
  • Specify the Bucket name and Bucket folder that we have created.
  • Add extra parameter for S3 CDC and full load.
includeOpForFullLoad=true;cdcInsertsAndUpdates=true;
  • Choose Run test and test the endpoint. If the test succeeds, be sure to save the endpoint. If the test fails, verify the role configuration and the S3 bucket name.

Step 9: Create a DMS task to extract data from RDS to S3.

  1. Open the AWS DMS console. Choose Tasks in the navigation pane, and then choose to Create task.
  2. For the task name, use archivetos3. Use the replication instance that was created and the endpoints that you created in Step 6.
  3. For Migration type, choose to Migrate existing data.
  4. Accept the default values in the Task Settings.

5. In the Table mappings section, choose the public for the schema field. For Table name, choose the table that you want to import.

6. AWS DMS provides an option to add column filters, which enable you to extract the data that meets your archive criteria. Choose the Add selection rule.

7. Choose to Create a task. The migration task then runs because the Start task on create a box is selected. The task status changes from Creating to Starting, then to Running, and finally to Load Complete.

8. You can verify that AWS DMS extracts the subset of information from the table as a CSV file and stores it in the S3 path that you specified.

Step 10: Check whether the file was uploaded to the S3 bucket.

  1. Go to the Amazon S3 console, and choose the rdstoglacier Choose the key (folder) named “database” that you selected. Choose the key public, which is the schema name. Choose a table, which is the table name.
  2. You will see the object LOAD00000001.csv file. You can download this file and examine it with a text editor if you want.

Step 11: Check whether the Lambda function was invoked for the upload process.

  1. Go to the AWS Lambda console and choose the archivetoglacier Choose the Monitoring tab, and check to see whether the function was invoked as the S3 trigger is configured on the bucket.
  2. Choose Jump to Logs to go to CloudWatch Logs for the Lambda function. Note the archived (value of u’archiveId’) of the Amazon Glacier upload operation from the log. You need this for the next step.

Step 12: Create a Lambda function for retrieving data from Glacier vault.

Create a Lambda function as follows:

  1. In the AWS Lambda console, choose to Create function. Choose the Author from scratch
  2. Type glacierquey for the function name.
  3. Choose the Python 2.7 runtime.
  4. Choose the existing role Lambda to execute for this function.
  5. Create the function.
  6. After it is created, edit the function.
  7. Replace all the code in the window with the Python code from the Github repository

In the Environment variables section, add the following variables. You must provide the correct values, such as bucket name, the prefix, and the archived from step 4.

8. Set the Lambda Timeout value to 5 minutes

9. Save the function.

Invoke the Lambda function glacierquery manually using a command line, or test it using a dummy “Hello World” test event, as shown following.

Test the Lambda function glacierquery using the test event that you created in Step 2. You can see the output in the Lambda console, as shown following. Alternatively, you can also view the information in the Lambda logs.

Step 13: View the results in the S3 bucket.

The following is sample data from Amazon Glacier using Amazon Glacier Select:

The S3 Glacier Select query takes 3–5 hours to respond and show up the results in the destination folder.

Conclusion

AWS provides a portfolio of data transfer services to provide the right solution for any data migration project. The level of connectivity is a major factor in data migration, and AWS has offerings that can address your hybrid cloud storage, online data transfer, and offline data transfer needs.

--

--