Build a Data Warehouse and Pipelines on AWS

A solution for streaming data from multiple sources to a centralized Data Warehouse on AWS to aid simplified business intelligence reporting.

Published in

CloudAdventure

4 min readOct 6, 2019

Let’s say you have multiple data sources on AWS. In this article, DynamoDB, MySQL database on RDS and S3 bucket. You can aggregate all these data sources into one centralized location by implementing your own Data Warehouse using AWS Services.

Worried about the complexity of having to design and implement a reliable and dependable Data Warehouse? The second part is that your data are everywhere and you don’t know how to ship them into one place. Don’t worry my friend 🤗, Amazon has taken care of that headache for us.

The AWS solutions:

Amazon Redshift! You can start with the 2 months free trial.
Amazon Data Pipeline

Launching a Redshift Cluster

Firstly, create IAM role for redshift Goto https://console.aws.amazon.com/iam/.
Choose Create role. In the AWS Service group, choose Redshift.
Under Select your use case, choose Redshift — Customizable then choose Next: Permissions.
On the Attach permissions policies page, choose AmazonS3ReadOnlyAccess. You can leave the default setting for Set permissions boundary. Then choose Next: Tags.
The Add tags page appears. You can optionally add tags. Choose Next: Review.
For Role name, type a name for your role e.g.RedshiftRole Create Role.
Choose the role name of the role you just created. Copy the Role ARN to your clipboard — save it somewhere
To start your free Redshift trial: Sign in to the https://console.aws.amazon.com/redshift/. Note: You are eligible for the free trial if your organization has not created an Amazon Redshift cluster since July 1st, 2014
Launch an Amazon Redshift cluster and select DC2.Large for Node

Node type: Choose dc2.large.
Number of compute nodes: Keep the default value of 2.
Cluster identifier: Enter the value demo.
Master user name: Keep the default value of demo.
Master user password and Confirm password: Enter a password for the master user account.
Database port: Accept the default value of 5439.
Available IAM roles: Choose RedshiftRole
Choose a VPC: Accept the default
Availability zone: Accept the default
Cluster subnet group: default
Publicly accessible: Yes
Follow the screen dialog to the end

Connect to your Redshift using Query Editor

Amazon Data Pipeline

This service allows you to move data from sources like AWS S3 bucket, MySQL Table on AWS RDS and AWS DynamoDB. With DynamoDB, you will need to export data to AWS S3 bucket first.

AWS Data pipeline allows you to schedule data export to you Redshift.

To setup Data pipeline for these data sources, Goto https://console.aws.amazon.com/datapipeline/ and follow the steps in the images below.

Data Pipeline launches AWS EMR which actually performs the export operation. Be sure to edit the resource used by AWS EMR (click edit Architecture), specify the subnet and instance sizes to be launched by EMR before activating the pipeline.