AWS Redshift — A Tiny Demo via Console and AWS CLI

Let us see how we can create a Redshift Cluster and load some data into it

Amazon Redshift is a managed, petabyte-scale data warehouse service. It is a columnar database mostly suitable for high-speed single column(field) based analytical queries. We are not going into the advantages of such a database etc. here.

Create a Redshift Cluster

Let us test the aws cli command to confirm that there is no Redshift cluster running in our default AWS region.

aws redshift describe-clusters
{
"Clusters": []
}

Before we create the cluster we will create an IAM service role that will be attached to the cluster. This for RedShift to access the S3 bucket to import data from. We will follow the steps from AWS documentation pasted below.

Here are the screenshots of what I did.

If you have the Redshift cluster already created you can attach this new service role by modifying the cluster.

Steps to Create the Redshift Cluster

Now we need to wait till the Redshift Cluster’s endpoint is available.

Ok, the endpoint is ready but it has detected that the port is not open.

Let us open the port 5439 in the security group attached to this cluster. We will need to enter our IP address as well.

Ok, now there are no errors, the endpoint should be accessible.

We check if the cluster is ready via the CLI as well.

aws redshift --region ap-south-1 describe-clusters
{
"Clusters": [
{
"ClusterIdentifier": "sreenov4demo-ci",
"NodeType": "dc2.large",
"ClusterStatus": "creating",
"MasterUsername": "root",
"DBName": "sreenov4demodb",
"AutomatedSnapshotRetentionPeriod": 1,
"ClusterSecurityGroups": [],
"VpcSecurityGroups": [
{
"VpcSecurityGroupId": "sg-a84bd3c0",
"Status": "active"
}
],
"ClusterParameterGroups": [
{
"ParameterGroupName": "default.redshift-1.0",
"ParameterApplyStatus": "in-sync"
}
],
"ClusterSubnetGroupName": "default",
"VpcId": "vpc-10a96a79",
"AvailabilityZone": "ap-south-1b",
"PreferredMaintenanceWindow": "sun:05:00-sun:05:30",
"PendingModifiedValues": {
"MasterUserPassword": "****"
},
"ClusterVersion": "1.0",
"AllowVersionUpgrade": true,
"NumberOfNodes": 1,
"PubliclyAccessible": true,
"Encrypted": false,
"ClusterPublicKey": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCjGA9EuKKqS/pUhvM093AzKk72XgC4zQMxM9eSisOE1chwP894IGd+/trCuCHmEz92sshSKrRXPhVF6/J5kAzS4cAgAs44jXVlO8Hbl8L6yBDXqaDNC344KJsn9aIpZx5K7gwjl/vWGIzHgR7jqMIuVr/e7R60KN88tJiKaTN7Qoo5JOli16Akd9iSVCyZi6oePXDStXlUiHERFoD5fvU90y80ryn47O0sNFe0q0bvrcUszRmVuqB7RuwI4Ega6ymyeold/skncyCHmqnZYQ2raiKHbuKiHB+RUJ499B9xxoQ5oBx4IZcfbH5 Amazon-Redshift\n",
"ClusterNodes": [
{
"NodeRole": "SHARED",
"PrivateIPAddress": "172.31.4.190",
"PublicIPAddress": "13.232.204.5"
}
],
"ClusterRevisionNumber": "4515",
"Tags": [],
"EnhancedVpcRouting": false,
"IamRoles": []
}
]
}

We will now need a client to talk to our remote Redshift cluster. Since AWS Redshift is a PostgreSQL engine implementation, the the PSQL client utility can be used to connect our cluster.

psql -h demo.cx01lraeixjw.ap-south-1.redshift.amazonaws.com -p 5439 --username=root demo
Password for user root:
psql (10.4, server 8.0.2)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

Create a Redshift Table

We will create a table that matches our input data dump.

create table users(
userid integer not null distkey sortkey,
username char(8),
firstname varchar(30),
lastname varchar(30),
city varchar(30),
state char(2),
email varchar(100),
phone char(14),
likesports boolean,
liketheatre boolean,
likeconcerts boolean,
likejazz boolean,
likeclassical boolean,
likeopera boolean,
likerock boolean,
likevegas boolean,
likebroadway boolean,
likemusicals boolean);

The newly created table structure can be seen below.

sreenov4demodb=# SELECT *    
sreenov4demodb-# FROM pg_table_def
sreenov4demodb-# WHERE tablename = 'users';
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull
------------+-----------+---------------+------------------------+----------+---------+---------+---------
public | users | userid | integer | none | t | 1 | t
public | users | username | character(8) | lzo | f | 0 | f
public | users | firstname | character varying(30) | lzo | f | 0 | f
public | users | lastname | character varying(30) | lzo | f | 0 | f
public | users | city | character varying(30) | lzo | f | 0 | f
public | users | state | character(2) | lzo | f | 0 | f
public | users | email | character varying(100) | lzo | f | 0 | f
public | users | phone | character(14) | lzo | f | 0 | f
public | users | likesports | boolean | none | f | 0 | f
public | users | liketheatre | boolean | none | f | 0 | f
public | users | likeconcerts | boolean | none | f | 0 | f
public | users | likejazz | boolean | none | f | 0 | f
public | users | likeclassical | boolean | none | f | 0 | f
public | users | likeopera | boolean | none | f | 0 | f
public | users | likerock | boolean | none | f | 0 | f
public | users | likevegas | boolean | none | f | 0 | f
public | users | likebroadway | boolean | none | f | 0 | f
public | users | likemusicals | boolean | none | f | 0 | f
(18 rows)

Some queries on the empty table.

sreenov4demodb=# SELECT username FROM users LIMIT 5;
username
----------
(0 rows)
sreenov4demodb=# SELECT username FROM users;
username
----------
(0 rows)
sreenov4demodb=# SELECT count(username) FROM users;
count
-------
0
(1 row)

Import Sample Data

We will use AWS provided sample data dump.

aws s3 ls s3://awssampledbuswest2/tickit/
PRE spectrum/
2015-04-24 08:05:03 0
2015-04-24 08:05:06 445838 allevents_pipe.txt
2015-04-24 08:05:05 5893626 allusers_pipe.txt
2015-04-24 08:05:04 465 category_pipe.txt
2015-04-24 08:05:06 14534 date2008_pipe.txt
2015-04-24 08:05:06 11585036 listings_pipe.txt
2015-04-24 08:05:16 11260097 sales_tab.txt
2017-04-15 06:01:01 7830 svl_s3query.txt
2015-04-24 08:05:08 7988 venue_pipe.txt
aws s3 cp s3://awssampledbuswest2/tickit/allusers_pipe.txt .

The allusers_pipe.txt dump looks like this.

Import data. You will get the below error if the role we created is not proper or not attached.

sreenov4demodb=# copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
sreenov4demodb-# credentials 'aws_iam_role=arn:aws:iam::11111111111:role/RedshiftDemoS3'
sreenov4demodb-# delimiter '|' region 'us-west-2';
ERROR: User arn:aws:redshift:ap-south-1:1111111111:dbuser:sreenov4demo-ci/root is not authorized to assume IAM Role arn:aws:iam::11111111:role/RedshiftDemoS3
DETAIL:
-----------------------------------------------
error: User arn:aws:redshift:ap-south-1:504881750319:dbuser:sreenov4demo-ci/root is not authorized to assume IAM Role arn:aws:iam::11111111:role/RedshiftDemoS3
code: 8001
context: IAM Role=arn:aws:iam::11111111:role/RedshiftDemoS3
query: 91
location: xen_aws_credentials_mgr.cpp:238
process: padbmaster [pid=9061]
-----------------------------------------------

Successful imported 49990 records.

sreenov4demodb=# copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt'                                                          credentials 'aws_iam_role=arn:aws:iam::11111111:role/RedshiftDemoS3'                                                                     delimiter '|' region 'us-west-2';
INFO: Load into table 'users' completed, 49990 record(s) loaded successfully.
COPY

Let us see the speed of Redshift by running some single column queries.

sreenov4demodb=# SELECT username FROM users LIMIT 5;
username
----------
PGL08LJI
XDZ38RDD
AEB55QTM
OWY35QYB
MSD36KVR
(5 rows)
sreenov4demodb=# SELECT count(likesports) FROM users WHERE likesports is TRUE;
count
-------
8682
(1 row)
sreenov4demodb=# SELECT firstname FROM users WHERE firstname like '%ben%' limit 5;
firstname
-----------
Reuben
Reuben
Reuben
Reuben
Reuben
(5 rows)
sreenov4demodb=# SELECT firstname FROM users WHERE firstname like 'Ben%' limit 5;
firstname
-----------
Benjamin
Benjamin
Benedict
Benjamin
Benedict
(5 rows)
sreenov4demodb=# SELECT firstname FROM users WHERE firstname like 'Sr%' limit 5;
firstname
-----------
(0 rows)
sreenov4demodb=# SELECT firstname FROM users WHERE firstname like 'S%' limit 5;
firstname
-----------
Scarlett
Shafira
Sean
Serina
Steel
(5 rows)

The AWS Console has these views for the cluster.

The “Loads” tab shows the data imports.

Performance view.

Queries we ran on the cluster and the speed.

Cleaning Up

Deleting the Redshift Cluster

Thanks for your time. For more such tiny demos do remember to follow.

Sreeprakash Neelakantan

Written by

AWS Certified DevOps Engineer & Solutions Architect Professional — Docker | Kubernetes | DevOps — Trainer | Running | Swimming | Cycling

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade