Parallelizing queries in PostgreSQL with Python
At Geoblink we run more than 20000 queries to generate just one of our several ~100Gb PostgreSQL databases from scratch from our raw data files. If we try to run them in sequential order, the database generation would take too much time. In order to reduce the generation time we parallelize several queries. Doing that by hand would be impossible so we use a nice script written in Python to generate and run the queries.
In this post I will show an example of how to do it in Python.
As an example, we will create a table and we will fill it with 20 million of random numbers.
First we create the table:
DROP TABLE IF EXISTS public.test;
CREATE TABLE public.test( value numeric );
And then we fill it:
INSERT INTO test
In my own machine (Intel(R) Core(TM) i7–4790 CPU @ 3.60GHz) it takes about 64 seconds.
Now let’s try to parallelize the query. We will use the python multiprocessing package to parallelize the processes and psycopg2 to connect with the database.
First we define the function to run a query:
connect_text = "dbname='%s' user='%s' host=%s port=%s password='%s'" % (dbname, user, host, port, password)
con = psycopg2.connect(connect_text)
cur = con.cursor()
With this function we can run any query that we want but without output, like creating tables, updating them, etc.
The next step is to create an array of queries.
queries = 
for i in range(N_job):
query = """
INSERT INTO test (value)
FROM generate_series(1, %d);
""" % (N_random / N_job)
Where N_jobs is the number of queries that we want to execute. With this, we have created an array of independent queries that we can execute in parallel.
Now we need to create a pool with the number of CPUs that we want to use, 4 in our case.
pool = multiprocessing.Pool(CPUS)
Finally we execute the queries using the imap_unordered method. It takes a function and an array and applies the function to each element of the array in random order.
for i in pool.imap_unordered(runQuery, queries)
With this method the calculation took only 30 seconds using 4 cores, more than a 50 % of improvement with respect to using only one query. Probably we didn’t get a biggest improvement because the read/write disk speed.
As more cores available, more advantage can we take from parallelizing queries. A good idea is to parallelize using existing indexed columns as provinces, municipalities or distances as we do when we create our databases in Geoblink.
But there is a point where we have to play with the ratio rows/job as we don’t want to have millions of rows per queries and neither thousands of queries.
The final conclusion is that, although parallelizing queries may result in a great performance improvement, we always need to keep a balance between the number of the queries we run and the number of rows per query.
By Vicente "Tito" Lacuesta