Selecting from an Oracle Database table in parallel using Python

A spate of questions came up on the interweb about fetching all the data from a large Oracle Database table by parallelizing the read across multiple Python threads.

Christopher Jones
Oracle Developers
3 min readFeb 14, 2023

--

Photo by Martin Sanchez on Unsplash

The typical queries in the questions were like:

select data
from mytable
order by id
offset :rowoffset rows fetch next :maxrows rows only

Where each thread was tasked to select and process a different range.

One common problem was that users were trying to execute these statements using a single connection. This isn’t going to work or help. Each Oracle connection can only do one database operation at a time, so there would be no parallelization.

Here’s a demo I threw together that actually does use multiple connections:

You can see I used a connection pool so each thread has its own connection. [Update: you might now be able to code this more elegantly with the new python-oracledb asyncio support.]

The demo has two alternative functions for processing data. The do_query() function just prints to the terminal, while do_write_csv() writes a CSV file per thread. You can modify the code depending which you want to play with. Both functions use fetchmany() to get batches of rows. I did this instead of callingfetchall() because it’s quite possible that a ‘real’ application would want each thread to be streaming batches of fetched rows somewhere instead of holding everything in memory.

The key tuning parameter for python-oracledb itself is the arraysize setting. The value of this can greatly affect how fast data is transferred from the database to Python. I didn’t change prefetchrows because for large fetches I tend to keep memory usage low; also the time benefit may be small — if there is any. See Tuning Fetch Performance in the documentation. Note that the default number of rows fetched by fetchmany() is the value of arraysize. However if you explicitly pass a size like fetchmany(1000) then the arraysize and related buffer are not changed so this won’t help tuning the internal network transfer. All it does is change the number of rows returned to the application by each call.

The demo is pretty rudimentary but suffices to show the concept of parallelization. And it shows there are a lot of factors: the number of threads, how many rows each thread should handle (i.e. how the data should be partitioned), how big to set arraysize, how many rows to fetch with fetchmany() in each iteration.

Overall the implementation raises more questions than it answers. And the performance answers are ‘it depends’. Is the data in the database spread across multiple spindles or is the one disk having to seek? Is the table partitioned? Are zone maps being used? Is Exadata with storage indexes being used? How does Python’s GIL impact the parallelization? (Maybe multiple Python processes should be used instead of threads?) What is the app doing with the data — can the receiving end efficiently process it? How heavily loaded is the DB? — the parallel solution may appear to be fast but it could be inefficient, thereby impacting, or eventually being limited by, everyone else. The OFFSET FETCH syntax will still cause table blocks to be scanned even though not all data is returned. Maybe the real performance bottleneck can’t be solved by parallelism: perhaps you have function based indexes that are being invoked for every row.

And so just how well does my demo work compared with a single query in a single thread? Well that is an exercise for you, with your environment and table and spindles and partitioning and memory and Python version and ….

Let me know what you find.

Thanks to my colleague Nigel Bayliss for raising some of these points with me (Update: the question was raised again. Nigel posted a direct reply which you can see here).

If you’re curious about the goings-on of Oracle Developers in their natural habitat, come join us on our public Slack channel!

If you haven’t already done so, you can sign up for an Oracle Cloud Free Tier account today.

--

--

Christopher Jones
Oracle Developers

Oracle Database Product Manager for language drivers including Python python-oracledb, Node.js node-oracledb, PHP OCI8 and more! On Mastodon: @cjbj@phpc.social