Creating a DataWarehouse Using Amazon Redshift for StackOverflow Data

Steps to create a data warehouse and automate the process of loading pre-processed data using pyspark script in EMR

Published in

The Startup

6 min readAug 15, 2020

This article is part of the series and continuation of the previous post where we processed the streamed data using spark on EMR.

Why use Redshift?

Redshift is a fully managed data warehouse solution provided by Amazon. Redshift is designed for analytic workloads and delivers fast query and I/O performance for any dataset by using 2 key techniques ; columnar storage and massive parallel processing.

Why is columnar storage so important?

Transactional operations differ majorly from analytical queries.
Transactional operations are indexed and very fast, however analytical queries are performed over a huge dataset that can take a lot of time to compute.
Relational databases store data in row form and are indexed by primary key for faster accessing. This is perfect for insert ,append or upsert operations.
However analytical queries are interested in aggregates(For example : sum of revenue of a particular region). Performing this query in a row base data base requires to scan the entire database, along with all the unused columns, which can be ineffective and performance heavy.
This is addressed by columnar databases which stores data in the form of columns instead of rows. This allows the data warehouse to store the data of the same column sequentially on a disk, which allows faster reading capabilities.

To summarise, here are the key advantages of Redshift

Read less data
Better compression
Perfect for OLAP
Parallel processing , by distribution of queries across multiple nodes.

How to Create a Redshift Cluster?

Amazon has made it insanely easy to create the redshift cluster. Below are the key main steps for creating a Redshift Cluster.

Step 1 : Login to the console and choose Redshift as the service and click on create cluster

Step 2 : Configure user name and password

3. Choose an IAM Role for Redshift(This can be the default IAM role)

4.Choose a Security Group and make the cluster Publicly accessible

The security group step is really important as without it you cannot connect to redshift using datagrip.

In the security group, make sure we have inbound rules configured for Redshift.
You can make this either my ip or choose anywhere option.

5. Configure DataGrip to connect to Redshift

DataGrip is designed to query, create and manage databases. If you have a JDBC connection, you can easily connect to any database.

You can install DataGrip from this site https://www.jetbrains.com/help/datagrip/installation-guide.html

From the properties panel of the Redshift cluster copy the JDBC URL

Provide the URL, username and password & test the connection in DataGrip

Step 8 : Create a Table in Redshift with the below query.

The schema should match that of the source data(In this case our stack overflow data)

create table stackoverflow (
question_id varchar(13),
is_answered varchar(10),
view_count int,
answer_count int,
score int,
creation_date date
)

Our basic Redshift cluster creation and configuration is done!!

Next Steps

We ned to write our pre-processed dataframe into the redshift cluster using pyspark. This step will be the last step in the script we created in the previous post.
Below is the snippet of the code for writing our spark dataframe into our redshift cluster.

We use spark-redshift for this purpose which is a library used to load data into Spark SQL DataFrames from Amazon Redshift, and write them back to Redshift tables. Amazon S3 is used to efficiently transfer data in and out of Redshift, and JDBC is used to automatically trigger the appropriate COPY and UNLOAD commands on Redshift.

However, this is easily said than done and I think this was the most challenging part of this project.

Key dependencies for PySpark to be able to call spark-redshift library are as follows :

In order to connect to Redshift using the Amazon Redshift JDBC Driver, the JDBC application or Java code that you are using to connect to your data must be able to access the driver JAR files. The driver Jar files can be downloaded from here and this needs to be available in the EMR cluster before the pyspark script runs http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver
The PySpark script should be able to find the necessary packages in the moment to execute it. The org.apache.spark:spark-avro_2.11:2.4.3 and com.databricks:spark-redshift_2.11:2.0.1 are the 2 necessary packages that are required to be added as part of the spark-submit step.
Since Redshift uses s3 as temporary storage, you need to set up proper access keys in the script. In order to achieve this, we need to add our aws access key and secret access key in our PySpark script.

After adding all these dependencies, our final script looks like below :

We also have a bash script which calls this PySpark Script with all the necessary packages and Jars.

What Next?

Now that we have our scripts ready, we need to test this in a emr cluster before we automate this in boto3.

Let’s see how to do that!!

Key Steps for testing in EMR Cluster

Step 1 : Create a dummy EMR cluster with spark added.

Step 2 : Copy the Jar file and our two scripts into the hadoop folder of EMR cluster

You can run the below commands in terminal to copy the files from local machine to EMR cluster

If your copy is successful, your terminal will show the below message.

Step 3 : ssh into the emr cluster and check if the file exists

ssh -i bigdata.pem hadoop@ec2–3–88–110–90.compute-1.amazonaws.com

You can use the ls command to list all the files in the home/hadoop directory

Step 4 : Execute the bash script to run our PySpark script

sudo sh Execute.sh

If your script is successful, then you will get the above message in your terminal.

Step 5 : After this is done, you can check the records in your stackoverflow table in DataGrip

If you get any errors while loading data in Redshift, it won’t be very evident from the terminal.

You will need to run the below query to understand the error.

SELECT le.starttime,d.query,d.line_number, d.colname,d.value,le.raw_line,le.err_reason
FROM stl_loaderror_detail d
JOIN stl_load_errors le
ON d.query = le.query
ORDER BY le.starttime DESC

Automating Above Steps using boto3

Now that we know our script works, we need to automate this. I want the EMR cluster to be a transient cluster as it just runs maybe for an hour everyday.

So I created two scripts :- One to create EMR cluster and copy all the scripts from my S3 location, second to execute my bash script to run the PySpark Script.

We would need to schedule these scripts in an ec2 instance using cron jobs.

I intend to schedule the first script which launches the cluster at 12 :00 am everyday- This will ensure it picks up all the streams created yesterday.
Second script to run the PySpark script should run at one hour after this.

The reason I had to do this way is because, aws runs the steps in random order to optimize the resources, so i was not able to sequentially run them.

Now, let’s finally get to the data visualization part using Einstein Analytics.

This is covered in detail in this post.