Importing data from a PostgreSQL database to a Pandas DataFrame

Alexandre Stamm
4 min readJun 11, 2022

--

Photo by Claudio Schwarz on unsplash

In this article, we’ll go over how to create a pandas DataFrame using a simple connection and query to fetch data from a PostgreSQL database that requires authentication.

Most people that work with data know that SQL is great for working with tabular data.

Depending on what company you are working for, you may not have access to no-code BI tools like Tableau or PowerBI. That means that when doing your analysis, you will probably have to perform all transformations & aggregations inside your DBMS and export data to a .csv file and do the visualizations in Excel or Google Sheets. But if you know Python, you’ll probably want to work with the dataset on Jupyter notebook with Pandas, right?

To work with PostgreSQL databases in Python I use psycopg2, which is the most popular PostgreSQL database adapter for Python.

The documentation for psycopg2 is found here: https://www.psycopg.org/docs/

Installing psycopg2 and connecting to the database

To use psycopg2, we first need to install it:

pip install psycopg2

Then, we need to connect to the database using the connect() function (docs):

conn = psycopg2.connect(“dbname=test user=postgres password=secret”)

The basic connection parameters are:

· dbname — the database name· user — user name used to authenticate· password — password used to authenticate· host — database host address· port — connection port number (defaults to 5432 if not provided)

After creating the connection, we need to open a cursor to perform database operations (docs):

cur = conn.cursor()

Then we need to execute a command using the cursor. We can do all sorts of commands (CREATE TABLE, INSERT INTO, SELECT…). Here we are going to focus on using the cursor to fetch data:

cur.execute("SELECT * FROM test;"): Execute the query;cur.fetchone(): Fetch the next row of a query result set, returning a single tuple, or None when no more data is available;cur.fetchmany([size=cursor.arraysize]): Fetch the next set of rows of a query result, returning a list of tuples.cur.fetchall(): Fetch all (remaining) rows of a query result, returning them as a list of tuples

If any changes were made to the database we need to commit:

conn.commit()

When we are done, we need to close the communication with the database:

cur.close()conn.close()

Wrapping all this up in a function for convenience:

First, let’s say we want to connect to our work database and it requires authentication. To avoid writing our credentials directly in our notebook and exposing them if we need to upload our code into GitHub or share our notebook with a colleague after our analysis is done, we can create a dotenv file with our credentials, and load it into our connect function. Then we can add our dotenv file to gitignore and also directly share our notebook without having our credentials leaked.

Here’s how to do it:

Create a .env file and write our credentials on it:

Since I’m using Linux, I use a simple “touch .env” command to create the file. On other OS, just create the file whichever way is easier for you.

Then, open the file and write our credentials:

DB_USERNAME=nameDB_PASSWORD=passwordDB_PATH=path_to_database(if it’s on AWS for example, you can type in the address like: yourcompany-database-address.us-east.rds.amazonaws.com)DB_NAME=db_name

Save and close the file.

Then, we are going to use the dotenv python library to load our credentials:

pip install python-dotenv
from dotenv import load_dotenv
load_dotenv() # take environment variables from .env.

Now, for the function I use to connect to the database:

import psycopg2import osimport sysdef connect():

“”” Connect to database “””
conn = None try: print(‘Connecting…’) conn = psycopg2.connect( host=os.environ[‘DB_PATH’], database=os.environ[‘DB_NAME’], user=os.environ[‘DB_USERNAME’], password=os.environ[‘DB_PASSWORD’]) except (Exception, psycopg2.DatabaseError) as error: print(error) sys.exit(1) print(“All good, Connection successful!”) return conn

And now we write a function to create a pandas DataFrame using a SELECT query:

def sql_to_dataframe(conn, query, column_names):   “”” 
Import data from a PostgreSQL database using a SELECT query
“””
cursor = conn.cursor() try: cursor.execute(query) except (Exception, psycopg2.DatabaseError) as error: print(“Error: %s” % error) cursor.close() return 1 # The execute returns a list of tuples: tuples_list = cursor.fetchall() cursor.close() # Now we need to transform the list into a pandas DataFrame: df = pd.DataFrame(tuples_list, columns=column_names) return df

For better convenience, I also recommend writing our functions into a .py file so we can reuse it whenever we want.

Loading information from the database as a pandas DataFrame

Now that we have everything we need, let’s finally put it all together and load our DataFrame from the database in our Jupyter notebook. We’re going to do that by importing Pandas, psycopg2, and our custom-made function for connecting and loading the data. Here’s the whole code:

#imports
import pandas as pd
import NumPy as np
import os
import psycopg2
from dotenv import load_dotenv
from functions import sql_to_dataframe, connect
load_dotenv()
#creating a query variable to store our query to pass into the functionquery = “”” SELECT column1,
column2,
column3
FROM database
”””
#creating a list with columns names to pass into the function
column_names = [‘column1’,‘column2’, ‘column3’]#opening the connectionconn = connect()#loading our dataframedf = sql_to_dataframe(conn, query, column_names)#closing the connectionconn.close()# Let’s see if we loaded the df successfullydf.head()

Conclusion

Finally, we’re done!

In this tutorial, you learned how to import data from a PostgreSQL database to a pandas DataFrame, while also keeping your credentials safe and writing a useful function that you can reuse in the future!

Thanks for reading!

Sources:

https://www.psycopg.org/docs/connection.html

https://www.postgresqltutorial.com/postgresql-python/connect/

--

--

Alexandre Stamm

Data Engineer & Enthuasiast. Data Analytics | Data Engineering | Data Science