PostgreSQL Integration with Jupyter Notebook
For a more seamless data extraction experience.
Jupyter Notebook is a great software to catalog steps taken during data visualization and analysis. It supports Python code, markdown, HTML, and thanks to a few libraries, PostgreSQL! This allows you to write SQL code right in the notebook and make changes, or query the database from there. You can also convert any query into a pandas dataframe so you can go straight to the analysis. It has made me love using SQL and Jupyter Notebook more!
Prerequisites:
- Jupyter Notebook
- PostgreSQL and a database (sample database & how to load)
Libraries to install:
- ipython-sql
- sqlalchemy
- A python database API (DBAPI) library
The first two libraries can be installed by using the pip install commands:
pip install ipython-sql
pip install sqlalchemy
The third library depends on what SQL software you choose to use. For PostgreSQL, you would use psycopg2:
pip install psycopg2
Different DBAPIs are documented on sqlalchemy’s website, like MySQL, Oracle, and SQLite (thanks Dave Campbell).
ipython-sql
Created by catherinedevlin on Github, this enables the use of SQL magic functions that contain %
and %%
, allowing you to write SQL style code right in Jupyter Notebook.
sqlalchemy
Originally authored by Michael Bayer, sqlalchemy is touted as a “SQL toolkit and object-relational-mapper” for Python. For the purposes of this tutorial, it will mainly be used to store SQL queries into a pandas dataframe.
Now that we have our libraries installed, let’s move on to Jupyter Notebook!
Getting Started
To load ipython-sql, use the following magic command:
%load_ext sql
Next, we will only need the create_engine()
function from sqlalchemy so let’s import that with the following line:
from sqlalchemy import create_engine
Once we’ve laid the groundwork, we can now connect to a PostgreSQL database!
Connecting to a PostgreSQL database
To connect ipython-sql to your database, use the following format:
# Format
%sql dialect+driver://username:password@host:port/database# Example format
%sql postgresql://postgres:password123@localhost/dvdrental
dialect+driver
in this case would just bepostgresql
, but feel free to use a different database software here.username:password
is where you will substitute your username and password.host
is usually just localhost.- In my experience,
port
does not need to be specified. database
is the name of the database to connect to.
To see if it successfully connected, Jupyter Notebook will print the following:
'Connected: username@database'
To connect sqlalchemy to the database, the format will largely be the same, but we will create a new object called engine
using the create_engine()
function:
# Format
engine = create_engine('dialect+driver://username:password@host:port/database')# Example format
engine = create_engine('postgresql://postgres:password123@localhost/dvdrental')
For the most part, you can just copy & paste what you wrote after the %sql
magic command when connecting to ipython-sql and encapsulate it in quotes as a parameter in the create_engine()
function.
As long as you see no errors in the output, you should be connected to the database!
Writing SQL Commands in Jupyter Notebook
To enable database querying and other commands, call the magic command %%sql
and add your SQL code after. The SQL code should be in its own block, separate from Python code (Thanks Daniel Upton for pointing that out!). Here’s sample code that queries the first three rows and all of the columns in the inventory table:
If you’d like to store your query in a pandas DataFrame, this is where sqlalchemy comes in. Create a dataframe object using the command pd.read_sql()
. It takes two arguments:
- Your SQL query encapsulated in quotes.
- The
engine
object you created earlier using thecreate_engine()
function.
The following is an example of how to implement a dataframe, inventory_dataframe
, using the inventory table query SQL code in the image above.
Once you have your query in a DataFrame, feel free to continue with your data analysis!