AWS Data Pipeline- Copy from DynamoDB Table to S3 Bucket

Published in

Tensult Blogs

4 min readAug 13, 2018

This Blog has moved from Medium to blogs.tensult.com. All the latest content will be available there. Subscribe to our newsletter to stay updated.

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals. It can copy from S3 to DynamoDB, to and from RDS MySQL, S3 and Redshift. Also, AWS Pipeline can copy these data from one AWS Region to another. Let us consider, you want to have a copy of data stored in a regional service like DynamoDB to an S3 bucket in a different AWS Region. Yes, AWS Pipeline can do that. Let us go through experiment in which DynamoDB table’s contents to copied to S3 Bucket in a different Region. AWS Data Pipeline is available in Northern Virginia, Oregon, Ireland and Tokyo regions at the time of writing.

How does it work ?

We are going to copy the contents of a DynamoDB table’s contents to S3 Bucket. AWS Data Pipeline, in turn, triggers an action to launch an EMR cluster with multiple EC2 instances and the administrator need not be aware of this EMR cluster. EMR cluster picks up the data from dynamo DB and writes to the S3 bucket.

About the Experiment

I created DynamoDB table called “emp table” in North Virginia(us-east-1) Region, an S3 bucket “myddbs3” as the destination to copy the data in the EU(Ireland) Region, and a bucket “logdynamodbs3 in North Virginia for logging. You will see all these with screenshots in the following steps.

Configuration

Create a DynamoDB table with some test data. I created an emp table in North Virginia Region.

2) Create an S3 bucket for the DynamoDB table’s data to be copied. I Created one with name “myddbs3” in the EU(Ireland) Region.

3) Create an S3 bucket for logging. I have created one with the name “logdynamodbs3” in North Virginia Region.

4) Access the Data Pipeline console from your AWS web console and create one like below. Fill all the required information (You can customize based on your need). Click “Activate” and your configuration is now ready. It will take around 10 Minutes for you get the final result.

Monitoring and Testing

In the List Pipelines, you can see first the status as “WAITING ON DEPENDENCIES”

2) After a few seconds status turns to “WAITING FOR RUNNER”. This stage Pipeline is waiting for EMR cluster to be initialized.

3) After a few minutes you can see again status changed to “RUNNING”.

4) This stage, if you open the EC2 console and select North Virginia Region you can see two new instances created automatically. This is because of the EMR cluster triggered by Pipeline.

5) After finish, you can access S3 bucket and find out that txt file is created which contains the DynamoDB table’s contents. Download it an open it in a text editor.

Conclusion

The redundancy of data is very important and it is a good strategy to store the data in different Regions with a cost-effective manner. AWS Data Pipeline is one service using which data can be copied between the databases and storage in different Regions. In the above experiment, we configured the AWS data pipeline to copy the data from DynamoDB table to S3 bucket, between AWS Regions.

Related documents,

Amazon Redshift — Connect from your from Windows machine

Redshift is a fully managed, petabyte-scale cloud based data warehouse solution from Amazon. You can start with…

medium.com