How To Connect AWS Redshift to Python Notebook

Using SQLAlchemy in Python to easily work with Redshift queries as pandas dataframe.

Published in

CodeX

4 min readOct 7, 2022

red ball on a white pedestal — Photo by Tran Mau Tri Tam ✪ on Unsplash

Do you or your organization use AWS Redshift to store data, but you are more comfortable working in Python Notebook (e.g. Jupyter Lab or Google Colab)? Ever wonder how to connect them together so you can work smoothly and do cool visualization in Python, or anywhere else for that matter?

This was the question that I had in mind when I was working on a client project. As a first step, I checked on AWS’ official documentation. I mean, it works, but it’s not exactly as smooth as I expected it to be. I also checked if I can save the queries I made in Redshift console as a csv file, but the documentation is a bit tricky to follow.

After conducting more research, I found that Redshift is simply PostgreSQL hosted on AWS server and given a new name (and while we’re at it, SageMaker is really just Jupyter Lab with an AWS label slapped on it). When I figured that was the case, I knew that I could use the classic Python library SQLAlchemy to help me with my predicament.

First of all, to make this solution work, you have to make the database publicly accessible by enabling it on the Redshift configuration. In the overview page, click on Actions → Modify publicly accessible setting. Make sure it’s enabled and then save the change.

Change Redshift configuration (Image by author)

Then we can go to Jupyter Lab and import the libraries we need such as pandas, os, and the create_engine function from sqlalchemy library. We also need to have psycopg2 library installed to handle PostgreSQL database. It’s not necessary to import it, but it’s good to just do it to check whether it is installed on our system.

import os
import pandas as pd
from sqlalchemy import create_engine
import psycopg2

If you don’t have the libraries above installed yet, you can use pip to install them first.

!pip install sqlalchemy
!pip install psycopg2

Now, we need to get the credentials from AWS. What we need is: username, password, endpoint, port, and database name. The username and password is the one used to login to Redshift. The endpoint and port can be copied from the overview page under the Endpoint section. Please mind the format being (endpoint):(port)/(extra info). The endpoint will look like a long URL and the port will look like a 4-digit number. We just need the endpoint and port. The database name is quite self-explanatory; you or your database admin should know the name.

Once we have the credential, we can put it into our notebook. I like to use os.environ for this step, because it improves security and makes the workflow smoother.

os.environ['REDSHIFT_USER'] = ""
os.environ['REDSHIFT_PASS'] = ""
os.environ['REDSHIFT_ENDPOINT'] = ""
os.environ['REDSHIFT_PORT'] = ""
os.environ['REDSHIFT_DBNAME'] = ""print('Secret keys stored successfully. Remember not to share this information')

Now, we need to create the engine of SQLAlchemy with the credentials above.

# create sql engine to connect to redshiftengine_string = "postgresql+psycopg2://{}:{}@{}:{}/{}".format(
    os.getenv('REDSHIFT_USER'), os.getenv('REDSHIFT_PASS'), os.getenv('REDSHIFT_ENDPOINT'), os.getenv('REDSHIFT_PORT'), os.getenv('REDSHIFT_DBNAME'))
engine = create_engine(engine_string)print('Engine created successfully')

With the hard part done, we can now use pd.read_sql() to execute the SQL query using the engine created earlier and store it as a pandas dataframe.

# change the query as neededsql = """
    SELECT *
    FROM <database_name>.private.<table_name>
    LIMIT 5
"""df = pd.read_sql(sql, engine)
df

And, that’s it! Since it’s a pandas dataframe, you can do whatever operation you would, like on any other dataframe. You can save it as a csv file by executing df.to_csv() or plot it by executing df.plot(), or whatever operation it may be.

I hope this tutorial helps you to query the data from Redshift to pandas dataframe as something you’re familiar with and confident in using. You can find the full notebook on this GitHub page. Good luck querying and analyzing the data!

How To Connect AWS Redshift to Python Notebook

Using SQLAlchemy in Python to easily work with Redshift queries as pandas dataframe.

Written by Abraham Setiawan