Postgres — Parallel jobs with Python

Virender Singla
Nerd For Tech
Published in
2 min readApr 12, 2021

Recently our team got multiple notifications from AWS for our Aurora Postgres clusters regarding deprecation of Postgres version 9.6.x. We must upgrade these databases to higher major versions (Aurora Postgres v10.x.x).

One important thing here is that Postgres does not copy statistics in major version upgrades. That means there may be plan flip for a critical query and that will cause application latency to go higher and hamper overall database performance (CPU spikes etc).

As an obvious thing, I thought of analyzing the tables in parallel just after upgrading (in downtime window only). To make this step smoother, I created a small Python script to run the commands in parallel. We can feed a file to this utility with all the queries we want to run in parallel manner.

Usage:

python3 pg_parallel_sql.py -e pgdb-pg.db -d pgdb -po 5200 -u useradmin -pa 48 -f analyze.sql -eoe-h, --help    : show this help message and exit                      
-e, --endpoint: endpoint identifier of Postgres Cluster [Mandatory ]
-d, --database: Postgres database name [ Mandatory ]
-po, --port : Postgres port number [ Optional, Default: 8192 ]
-u, --user : Username for authentication [ Mandatory ]
-pa,--parallel: Parallel processes to run analyze commands
[Optional, [1-96]]
-f, --file : filename containing all the queries [ Mandatory ]
Queries should be termiated by ";"
-eoe, --exitonerror: Exits on 1st sql query error otherwise continue
[Optional, Default: False ]

Output:

2021-04-12 16:56:25,969 INFO:Schemas by Table count
+--------------------+------------+
| SchemaName | TableCount |
+--------------------+------------+
| aws_oracle_ext | 7 |
| information_schema | 7 |
| public | 23 |
| pg_catalog | 55 |
| appschema | 151 |
| admin | 259 |
+--------------------+------------+
2021-04-12 16:56:29,538 INFO:Processing Queries in file analyze.sql
100%|█████████████████████████████| 145/145 [00:54<00:00, 2.68it/s]
2021-04-12 16:57:24,201 INFO: 0 error(s) in log file /tmp/pg_parallel_sql.log
2021-04-12 16:57:24,202 INFO:Total execution time: 0:00:58.232558
code

Update:

Above python script would be handy for running a task in parallel and show the progress, though there are other ways to do the same task.

  1. vacuumdb utility: This is a nice client side utility to run the vacuum/analyze commands cluster wide. It supports “ — analyze-in-stages” and “ — jobs” parameter to run analyze commands incrementally and parallelly.
  2. Another interesting and a much easier way to run commands in parallel is to use the “xargs” command.
xargs -d "\n" -n 1 -P 20 psql database_name username -c < list_of_commands.txt

Ref: https://markandruth.co.uk/2016/05/26/running-lots-of-postgres-commands-in-parallel

Closing Notes: This is a basic and quick script to do the job and is not a very efficient one. It can be improved further like using connection pooling for database connections, reading file in chunks to avoid memory issues and then may be a better logger.

--

--