Understanding PostgreSQL Cursors with Python

Published in

Dev bits

9 min readJan 10, 2021

Have you ever wondered why database cursors even exist? I bet there will be instances where a developer wants to load massive SQL query response data in their application. Ex: Fetch data of a million records for processing into your application.

What if the application’s memory (ECS task, EC2 instance or a VM) cannot hold that million records SQL response and crashes the entire application? To your surprise, at the end of this article, you’ll know the solution. As a developer, you should know cursors well to optimize your application.

Before you immerse into the article, if you are new to programming or want to learn Python in-person, please checkout:

https://happy-pythonist.com

Why need cursors?

There are generally two ways an application can handle vast amounts of SQL query results:

Set the memory of application high(thereby goes the costs up)
Use database cursors to throttle data to manageable limits.

I hope the first way is not so attractive because of two reasons:

No guarantee increasing memory can future-proof your solution
Suppose you pre-allocate huge memory for your application anticipating vast amounts of data in future. There could be possible under-utilization of resources.

It is thereby discouraged to go in the direction of increasing memory whenever your application crashes due to memory limits.

Now coming to the second way, we can use database cursors. People who worked with PostgreSQL and Python might have already worked with an object called `cursor` in their code. There is much more a cursor can do apart from merely holding data from a SQL query result.

A cursor can also paginate the results from a SQL query, allow moving forward and backward across rows etc. Simply put, It will enable a developer to design a solution that can work with vast amounts of data without any problem.
In this article, we use PostgreSQL, Psycopg2 driver to illustrate the different capabilities of cursors.

Preparing the environment

For brevity, we use docker & docker-compose for setting up the app & PostgreSQL database quickly. You can also use a manually installed database and a virtual environment.

We use a Dockerfile, a docker-compose file and an app folder. All lies in the project root. Also, we can download a sample database called `dvdrental.tar` from here: https://www.postgresqltutorial.com/postgresql-sample-database/

We plan to restore that DVD sample data into PostgreSQL once the database service is up by docker-compose. We need to make the backup dvdrental.tar available to the database container. Let us put it in a folder called backup

The structure of the project looks something like this:

.
├── Dockerfile
├── app
│   ├── main.py
│   └── requirements.txt
├── backup
│   └── dvdrental.tar
└── docker-compose.yaml

Now, let us fill the docker-compose file from the official template, https://docs.docker.com/compose/

We are building two services, one for the app(even though it is just an executable container for our illustration) and the other one for the PostgreSQL database. I am choosing dvdrental as the database name. By specifying volumes, we are mounting the backup directory to the database container at location /tmp.

Our core logic lies in the `main.py` file. Let us write the Dockerfile that uses Python:3.8-slim image and runs the main.py file on start.

Now, let us start the database and restore the DVD rental data. For that, run the docker-compose up command from your favourite terminal or shell.

$ docker-compose up -d db

-d option starts the docker container as a daemon.

It starts the PostgreSQL database container and creates a database called `dvdrental.` Now, we can restore data from the backup file by entering into the database container like this:

$ docker-compose exec db /bin/bash# You are in container nowroot@4db182497566:/# pg_restore -U postgres -d dvdrental /tmp/dvdrental.tar

Now database and data are ready. We need to install psycopg2-binary as a requirement. So let us add it as a requirement to our project.

# ./app/requirements.txt
psycopg2-binary==2.8.6

PostgreSQL supports two types of database cursors:

Client-side cursors
Server-side cursors

Client-side Cursor

Whenever someone connects to PostgreSQL using psycopg2 python driver, they create a connection object followed by a cursor object. That cursor object is recognized as a client cursor by PostgreSQL. It is a simple iterator that pre-fetches rows for a given SELECT SQL query. Because it fetches everything at once, with millions of rows, this object gets clunky and sometimes can crash the application.

We want to query actors from the actor table whose first name starts with John. The program for it would be:

Here, we connect to the PostgreSQL database container using hostname DB and creating a connection object. Next, we make a standard cursor from the successful connection. We can execute as many queries on this cursor object. The cursor is a client-side cursor for PostgreSQL. As we already mentioned, this type of cursor pre-fetches all available rows for a SQL query.

A Cursor’s execute method saves the query result in the cursor. We used the fetchall function to retrieve all rows into a list. To run the program from another terminal, type the command:

$ docker-compose build app
$ docker-compose run app

You should see an output of multiple actor rows printed to standard output.

Fetched records: [(5, 'Johnny', 'Lollobrigida', datetime.datetime(2013, 5, 26, 14, 47, 57, 620000)), (40, 'Johnny', 'Cage', datetime.datetime(2013, 5, 26, 14, 47, 57, 620000)), (192, 'John', 'Suvari', datetime.datetime(2013, 5, 26, 14, 47, 57, 620000))]

There are three essential methods for plucking out data from a Psycopg2 cursor:

fetchone
fetchmany
fetchall

fetchone

This method gets one row from the query result. If you call it again, it gets next row, after next row until there are no more rows. You can use it like this roughly.

fetchmany

This method fetches a given number of rows from a query result, then adds client-side pagination for applications. It returns an empty list once the rows are exhausted. One needs to loop and call this method to reach the end of the list.

fetchall

This method fetches all available rows as a list. It returns an empty list if there are no rows found. We showed the usage of fetchall in the previous main.py file for bringing actors first_name starting with “John.”

There are two essential properties one should know while working with cursors:

query

After one executes a SQL query, they can see the well-formed database query with values escaping the query parameter. Sometimes, it could be useful for debugging applications.

name

The name property tells you whether a cursor object is client-side or server-side. In main.py, we didn’t pass any named parameter while instantiating the cursor object. It means it is a client-side cursor. An overview of the methods and properties of a cursor is below.

Server-side Cursor

The server-side cursor is a cursor used to throttle the data. For Ex: We can manually make multiple queries to the database to fetch data for ids in a paginated list from the application. The server-side questions provide a flexible way to do that, and PostgreSQL supports it out of the box. It is like a supply-chain where clients can request a certain number of rows on the first flight and make further queries to fetch data on-demand. All happens with one cursor in an abstracted way.

What should one do to make a cursor server-side? Just pass name into cursor constructor like this, and PostgreSQL treats it as a server-side cursor.

In the above code snippet, we instantiated a cursor with a string parameter called actor.

What exactly happens on the database side with the name passed in the cursor? Let us see the database logs.

When I do docker logs <db_container>, I can see exactly how the database reacts to the cursor from Psycopg2.

db_1 LOG:  statement: BEGIN
db_1 LOG:  statement: DECLARE "actor" CURSOR WITHOUT HOLD FOR SELECT *  FROM actor WHERE first_name LIKE 'John%'
db_1 LOG:  statement: FETCH FORWARD ALL FROM "actor"
db_1 LOG:  statement: CLOSE "actor"
db_1 LOG:  statement: COMMIT

PostgreSQL declares a cursor for the given SQL query. Think it like creating an index on a table. After cursor declaration in the database, the `FETCH` query can be used by the server to forward data to the client. You can even test it in psql; follow the below link.

FETCH

FETCH - retrieve rows from a query using a cursor FETCH [ [ FROM | IN ] ]cursor_name where can be empty or one of: NEXT…

www.postgresql.org

In the above examples, there were only three rows. To show the real power of a server side cursor to fetch a given number of rows as a batch, let us pick the film table. It has 1000 film records. Let us say we have a use case.

Ex: Forward only 200 records/batch from server-side cursor per request

There are two ways of achieving the use case:

Way 1: Use `itersize` with server-side cursor

Let us say we are only interested in fetching 200 movies at a time into a client and wants to make multiple batch queries; we can do it like this:

cursor.itersize controls the maximum rows fetched in a batch. If you add the above changes to main.py, build the app and run it like this:

docker build app
docker run app

You now will see 1000 rows printing on the standard output. At the same time, you can also see the database logs forwarding 200 rows at a time.

db_1 LOG:  statement: BEGIN
db_1 LOG:  statement: DECLARE "film" CURSOR WITHOUT HOLD FOR SELECT * FROM film
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: CLOSE "film"
db_1 LOG:  statement: COMMIT

Between each “FETCH FORWARD 200” calls, there will be a database network request. So, if data is small enough to flush in one single request, you can increase the cursor.itersize value. The client made six requests to the database. So one cannot have small batches. It can make the whole process very slow.

Way 2: Use `fetchmany` with server-side cursor

We can achieve the same result as itersize property using fetchmany with a server-side cursor to reduce the no of requests from client to database.

We can implement the solution like this:

In this case too, the database logs show there are only six forward requests from client to server.

db_1 LOG:  statement: BEGIN
db_1 LOG:  statement: DECLARE "film" CURSOR WITHOUT HOLD FOR SELECT * FROM film
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: FETCH FORWARD 200 FROM "film"
db_1 LOG:  statement: CLOSE "film"

Both the ways work precisely the same way in forwarding batches from the server, but what is the difference? The second approach (Way 2) allows us to access more data than iterating over one row at a time. We can use this big slice to produce something concurrently compared to one row at a time.

Is there a silver bullet for picking the right cursor batch size & settings?

No. Honestly not. But, system usage metrics can give hints about a decent batch size after a certain amount of time. The software development itself is iterative and incremental and should drive our next steps.

I follow these four steps cycle to come up with a solution that serves the purpose:

Prepare an initial hypothesis for data retrieval(picking a batch-size based on memory and data payload size)
Set cursors according to that hypothesis
Measure the system (Is the application about to crash? is the process slow due to too extreme batching?)
Adjust the batch size or memory
Repeat steps 3 and 4

The code for this article can be found at:

narenaryan/postgres-cursors-article

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

References:

https://www.psycopg.org/docs/index.html
https://docs.docker.com/compose/
https://hub.docker.com/_/postgres
https://www.postgresql.org/docs/13/plpgsql-cursors.html
https://www.python.org/dev/peps/pep-0572/
https://www.docker.com/blog/containerized-python-development-part-1/