How To Connect AWS Redshift to Python Notebook

Using SQLAlchemy in Python to easily work with Redshift queries as pandas dataframe.

Abraham Setiawan
CodeX
4 min readOct 7, 2022

--

red ball on a white pedestal
Photo by Tran Mau Tri Tam ✪ on Unsplash

Do you or your organization use AWS Redshift to store data, but you are more comfortable working in Python Notebook (e.g. Jupyter Lab or Google Colab)? Ever wonder how to connect them together so you can work smoothly and do cool visualization in Python, or anywhere else for that matter?

This was the question that I had in mind when I was working on a client project. As a first step, I checked on AWS’ official documentation. I mean, it works, but it’s not exactly as smooth as I expected it to be. I also checked if I can save the queries I made in Redshift console as a csv file, but the documentation is a bit tricky to follow.

After conducting more research, I found that Redshift is simply PostgreSQL hosted on AWS server and given a new name (and while we’re at it, SageMaker is really just Jupyter Lab with an AWS label slapped on it). When I figured that was the case, I knew that I could use the classic Python library SQLAlchemy to help me with my predicament.

First of all, to make this solution work, you have to make the database publicly accessible by enabling it on the Redshift configuration. In the overview page, click on Actions → Modify publicly accessible setting. Make sure it’s enabled and then save the change.

Change Redshift configuration (Image by author)
Enable public access (Image by author)

Then we can go to Jupyter Lab and import the libraries we need such as pandas, os, and the create_engine function from sqlalchemy library. We also need to have psycopg2 library installed to handle PostgreSQL database. It’s not necessary to import it, but it’s good to just do it to check whether it is installed on our system.

If you don’t have the libraries above installed yet, you can use pip to install them first.

Now, we need to get the credentials from AWS. What we need is: username, password, endpoint, port, and database name. The username and password is the one used to login to Redshift. The endpoint and port can be copied from the overview page under the Endpoint section. Please mind the format being (endpoint):(port)/(extra info). The endpoint will look like a long URL and the port will look like a 4-digit number. We just need the endpoint and port. The database name is quite self-explanatory; you or your database admin should know the name.

Endpoint and port (Image by author)

Once we have the credential, we can put it into our notebook. I like to use os.environ for this step, because it improves security and makes the workflow smoother.

Now, we need to create the engine of SQLAlchemy with the credentials above.

With the hard part done, we can now use pd.read_sql() to execute the SQL query using the engine created earlier and store it as a pandas dataframe.

Example query (Image by author)

And, that’s it! Since it’s a pandas dataframe, you can do whatever operation you would, like on any other dataframe. You can save it as a csv file by executing df.to_csv() or plot it by executing df.plot(), or whatever operation it may be.

I hope this tutorial helps you to query the data from Redshift to pandas dataframe as something you’re familiar with and confident in using. You can find the full notebook on this GitHub page. Good luck querying and analyzing the data!

--

--

Abraham Setiawan
CodeX

Data Analyst student at Hyper Island with experience in product and innovation. I write about my journey in the data world. Website: abrahamsetiawan.com