Amazon Redshift: Data warehouse in the cloud

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift is a relational database system built on PostgreSQL(PSQL) principles. It’s designed and optimized for performing online analytical processing (“OLAP”) queries efficiently over petabytes of data.

So what is OLAP ??

OLAP (online analytical processing) is a computing method that helps us to extract and query data to analyze it from different points and obtaining useful information from the huge amount of data.It is very important in BI(Business Intelligence)

So the question arises “why “Redshift” when AWS already have ‘Relational Database Service’(RDS)?”

By using RDS we can easily store relational data in huge amount but when we try to analyze our stored data by applying complex queries ,RDS takes huge amount of time and in meantime it also freezes the other queries.So AWS came up with Redshift where we can store huge amount of data and also perform analysis at a tremendous faster rate.

So actually how it happens??

The heart of each Redshift deployment is a cluster. Each cluster has one leader node and one or more compute nodes, all connected by high speed links. There are two types of node each with different CPU, RAM and storage characteristics.
  1. Dense Compute Node
  2. Dense Storage Node
The Dense Compute (DC) nodes speedsup the query execution, with less storage, and is best for high performance activities. It is implemented with SSD drives.
The Dense Storage (DS) nodes are used for storing and querying big data, using typical hard disk drives.

Now Let's understand How to shift Data from RDS to Redshift to perform analysis.

Here are the steps:

  1. Create a Redshift cluster
  2. Export a MySQL database and split it into multiple files
  3. Upload the files to Amazon S3
  4. Run a COPY command to load the table to Redshift
  5. Verify that the data was loaded correctly
Step1: Redshift cluster can be done from AWS UI following simple steps.
Fig: launching Redshift cluster
step2: Export a MySQL database and split it into multiple files(because if we try to store all data into single csv file then it may hang up system memory,so it is advisable to split into multiple csv’s)

Here we export our relational data into multiple csv’s

Example: consider a table has 1 crore row in RDS, then we need to create 10 CSV file each with approx 10 lakh data.

Command to do this:

mysql -u username-p --database=database-name --host=hostname --batch -e "sql query" | sed 's/\t/","/g;s/^/"/;s/$/"/;s/\n//g' > name_of _csvfile.csv

example:

$ mysql -u username -p --database=mydb--host=test-db.amazonaws.com --batch -e "select * from table_name limit 0, 1000000" | sed 's/\t/","/g;s/^/"/;s/$/"/;s/\n//g' > csv_file_1.csv

Explanation:

select * from table_name limit 0, 1000000"
it will take data from row 0 to row 10 lakh
limit =0
offset =1000000
SED command stands for stream editor , it  perform function on file like, search, find and replace, insert or delete
Here it replace extra space,tab,new lines in our data with blank space.
  • u =>username
  • (>) this angular bracket command will store the output of left side into right side i.e into a CSV file in our case.
  • p =port
  • database=>database name
  • host=> hostname
Note:store these multiple csv in one folder,because it becomes easy for moving these files to S3 in step 3

Step3: Upload these CSV files to Amazon S3

Instead of creating a folder in s3 and then moving these csv files there,

we can use one shortcut here ,we can sync that folder inside s3-it will not only create new folder inside s3 but it will also move those files inside our bucket and in if we update that folder then it will also automatically update our s3 bucket .

command to do this:

$ aws s3 sync csv-folder s3://bucket-name/folder-name/csv-folder

Explanation:

csv-folder => folder where multiple csv’s are stored.

Step4: Load data to redshift from s3 using copy command

Prequisite: Install PSQL first to access redshift from EC2
$ yum install -y postgresql

To access redshift using EC2 instance:

$ psql -h redshift-cluster.amazonaws.com -U redshiftuser -d dev -p 1010 redshift-cluster.amazonaws.com

Explanation:

psql -h <endpoint> -U <userid> -d <databasename> -p <port>

Command to copy data from s3 to redshift.

  1. Create a table similar to the one we have in MySQL (RDS)
  2. Now-
#copy new-table-name from 's3://bucket-name/folder-name/csv-folder/' iam_role 'arn:your_arn' DELIMITER ',' IGNOREHEADER 1 emptyasnull blanksasnull removequotes escape

Step5: vacuum the table to sort and compact

#Vacuum table_name

Example:

#vacuum new-table-name to 100 percent;

Now after performing analysis using complex psql queries if we want to store result back to S3 then
To transfer data from redshift to S3 bucket
Command:
# UNLOAD ('query') TO 's3 path/file_name'
iam_role 'your_arn'
DELIMITER AS '|'
ADDQUOTES
NULL AS '';

Example:
UNLOAD
('SELECT *  FROM table-name  WHERE id=1')
TO 's3://bucket-name/folder/result.csv'
iam_role 'your-arn'
DELIMITER AS '|'
ADDQUOTES
NULL AS '';

Explanation:

Delimiter ‘|’ ->it will use a pipe(|) as a separator in our result.csv file

Null as ‘ ‘ =>it will make null data blank.


More insights about Redshift:

Amazon Redshift is based on an older version of PostgreSQL 8.0.2, and Redshifthas made changes to that version. An initial preview beta was released in November 2012 and a full release was made available on February 15, 2013

Note :ARN,ENDPOINT can be obtained from aws account console.

Topics for Next blog:

Lambda service,Sagemaker for ML