Create Your First PostgreSQL Database in Python With Psycopg2
Today I am going to show you how to create and modify a PostgreSQL database in Python, with the help of the psycopg2 library.
Unlike SQLAlchemy that generates SQL queries while mapping the database schema to Python objects, psycopg2 takes your hand-crafted SQL queries and executes them against the database. In other words, SQLAlchemy is an ORM (Object-Relational Mapper) and psycopg2 is a database driver for PostgreSQL.
Create the database and its tables
We’ll be creating a dead simple housing database that consists of two tables: “person” and “house”.
Each house has an address and multiple people can live in the same house (1 to many relationship/many people can live in a single house). This relationship is realized by using
id as the primary key of “house” and
house_id as the foreign key of “person”.
Now let’s see the complete script for the creation of the database and its tables, followed by an explanation of the code.
(in case you are not familiar with the
if __name__ == "__main__" condition, it checks that the script being executed is the current script)
In essence, this script does the following:
- Loads database connection information from a .ini file;
- Connects to PostgreSQL;
- Creates the “houses” database;
- Connects to the newly-created database;
- Creates the “house” table; and
- Creates the “person” table.
The contents of the .ini file are just a set of variables for the database connection.
Think of it as storing API keys and other sensitive information in environment variables instead of hard-coding it in the script. Though since this is a local database it’s fine to show you my credentials.
After loading this information with the
load_connection_info function as a dictionary (line 58 of the code gist), we connect to PostgreSQL. Because the database does not exist yet, we connect to the engine itself. The creation is handled by the
create_db function (lines 16 to 36).
psycopg2.connect returns a connection between Python and PostgreSQL, from which we create a cursor. Cursors are created to execute the code in the PostgreSQL.
After that, still in
execute the database creation query by passing it a string with the proper SQL code. This is wrapped in a try/except/else block in case something goes wrong. Usually we first execute the query and afterwards commit it to the database, but “CREATE DATABASE” statements require the commit to be automatic, hence using
Okay, the “houses” database is created, the next step is to create the “house” and “person” tables. First connect to the newly-created database on line 64 and create a new cursor on line 65. Instead of passing each argument separately (host, database, user and password), we use the
** operator to unpack each key-value pair on its own.
Then, we create the “house” and “person” tables. We write the necessary SQL queries and call the
create_table function on lines 63 to 74 and 77 to 85, respectively.
create_table simply executes and commits the queries, wrapped in a try/except/else block. If nothing bad happens, the changes are committed to the database inside the else block, otherwise the exception and query are output in the except block. In case an exception is raised we also rollback any changes that were not committed.
At the very end of the script we close all active connections and cursors.
Insert data into the database
At this point all the infrastructure is setup and we can move on to insert data.
Our goal is to insert pandas DataFrames in the database. However, even if we are only inserting a handful of rows in each table, the way the
insert_data function is written allows for inserting DataFrames with hundreds of rows in the database. This is achieved by using
execute_values instead of the basic
execute function, allowing for insertion in batches intead of a single gigantic query. No matter the number of rows in the DataFrame,
execute_values will only insert 100 rows at a time, the
page_size value specified.
Okay, but let me give you a bullet point summary of the steps taken:
- Connect to the database;
- Create a DataFrame for the “house” data;
- Insert the “house” data in its table;
- Create a DataFrame for the “person” data; and
- Insert the “person” data in its table.
And there’s not much more to it. The basic logic when working with psycopg2 lies in the execute/commit/rollback trio of actions. You first execute a SQL query and if everything goes well you commit the changes, otherwise you rollback the changes. Whatever you’re trying to do, your go-to actions are execute, commit and rollback.
Just two notes to take in consideration for insertions:
execute_valuesonly accepts data as tuples, hence the transformation on line 15; and
- The query string must have a
execute_valuescan replace the tuple of data into that query string (lines 54 and 63). If you are not familiar with this syntax for string replacement, check out this article to learn more.
Finally, we can write some SELECT queries to extract data!
I mentioned execute, commit and rollback, but there is another common function used when working with psycopg2: fetch. When the query executed returns some data, fetch is how we get that data. And just like execute has variations, fetch has some too. We can use
fetchone to get the next row of data,
fetchall to get all rows at once or
fetchmany to get a batch of rows at a time.
In this case we are using
fetchmany. Again, each table has only a handful of rows, but I wrote the code in a way that is easy for you to refer back to in the future if needed. Just like with
execute_values, by using
fetchmany you can balance the memory costs of working with large amounts of data and performance of the code.
And just as we inserted data from DataFrames, now we want to extract data into DataFrames. For that, we start by creating an empty DataFrame with only its column names specified. This is important because these names must match the names of the columns in the query results.
get_data_from_db (lines 28 to 58), we execute the SQL query and keep fetching the next 100 rows as long as there are rows left. Of course, in this case we only fetch once per query. For each row of the batch fetched, we save it in a list of dictionaries that map columns to the values, that is, each dictionary represents one of the rows fetched. Then, this list of dictionaries is appended to the DataFrame. This way the data returned is always appended to the DataFrame. Once fetch has reached the last row of data, it will return an empty list. At that point we stop fetching by breaking out of the loop.
At this point the complexity depends on your SQL code rather than Python and psycopg2. For instance, the first two SELECT queries simply return all data from the two tables available, but the third query joins both tables to return the names of the people and the address of their house. Still an easy query, but slightly more complicated than a
Lastly, here are the DataFrames of extracted data:
Overall, I think the code in this demo shows a good degree of separation between the SQL and the Python code, while staying flexible enough. Plus, this code should be more than enough to get you started in integrating PostgreSQL operations in your Python code.
To recap, this demo went through the following database operations in Python (using the psycopg2 driver):
- Creating a database;
- Creating tables;
- Inserting data (from pandas DataFrames);
- Extracting data (into pandas DataFrames); and
- Batch insertion and extraction.
If you also want to look into using an ORM as an alternative to a database driver, I recommend starting with this introductory article about SQLAlchemy. You can also read the official documentation for more about SQLAlchemy and PostgreSQL here.
Lastly, all code for this demo is available on GitHub here.