Parallelisation In Python — An Alternative Approach

Published in

idealo Tech Blog

6 min readJul 4, 2018

One of the biggest, ever-present banes of a data scientist’s life is the constant wait for the data processing code to finish executing. The problem with slow code affects almost every step of a typical data science pipeline: data collection, data pre-processing/parsing, feature engineering, etc. Many times, the gigantic execution times even end up making the project infeasible and more often than desired, forces a data scientist to work with only a subset of data, depriving him/her of the insights and performance improvements that could be obtained with a larger dataset.

One of the tools that could mitigate this problem and speed up data science pipelines is parallelisation through multiprocessing or multithreading.
In this blogpost, I am sharing my experience of solving a data collection problem by using an approach that utilizes multiprocessing with python. The main achievement of this exercise was the development of a general approach for parallelisation wherein a code template was created and used. Using the approach, I was able to reduce the execution time from several days to under an hour for a practical problem discussed in this blog. The same code template has been tested for other use cases and has proven to be effective.

The Vanilla approach to parallelisation

This is the usual way parallelisation is applied in python.

Get a list of all items for which the same task needs to be performed independent of other items.
Workers execute in parallel and each worker applies the task logic to a single item in this list and obtains the result for this specific item.
After a worker finishes the task on a single item, it picks up another item from the global list of items and applies the task logic to it.
The process goes on until the list of items gets exhausted.

The Splitting approach to parallelisation (New)

This is the new parallelisation approach.

Get a list of all items for which the same task needs to be performed independent of other items.
Divide the list of items into sublists.
Pass one sublist to each worker/thread so that each worker/thread iterates over its own sublist in a serial manner and performs the required task.
The workers themselves execute in parallel. Thus, even though each worker is performing the task in a serial manner, the workers are working simultaneously on their own sublist of tasks and hence, parallelisation is achieved as the net effect. Therefore, the approach is actually a mix of parallel and serial processing.

In contrast, the Serial approach works as follows:

Get a list of all items for which the same task needs to be performed.
Pick up an item in the list and execute the task for this item.
Pick up the next item in the list. The process goes on until the list gets exhausted.

Problem Description

For one of my recent projects, I was faced with the problem of downloading a huge number of emails from a mail server (~3 million). An attempt to download the mails serially resulted in 100k emails being downloaded after 5 days of run time. The email download time clearly needed to be reduced for which parallelisation seemed like an obvious candidate.

For the purpose of code demonstration, I am downloading emails from a public imap (created a fake email id on yandex and filled it with 1200 emails) using Multiprocessing with the splitting approach. Later on, the same task is going to be performed with all the variants described above as well (code not shown for these in the post). My machine is a MacBook Pro with 2.8 GHz i7 quad core processor.

Multiprocessed splitting

Get a list of all mail ids present in the mail box after establishing connection with the imap server.
Select the number of multiprocessing workers (I generally set this to twice the number of logical cores on my machine).
Divide the list of mail-ids into sublists.
Launch multiple processes where each process corresponds to a worker and has a sublist of items it must operate upon. Each process establishes an independent connection with the imap server and calls a function that performs the download.
Each worker downloads the mails serially using the sublist of mail-ids it possesses.

Let us develop the logic for each of these steps:

1. Get a list of all mail ids present in the mail box after establishing a single connection with the imap server.

2. Select the number of multiprocessing workers (I generally set this to twice the number of logical cores on my machine).

3. Divide the list of mail-ids into sublists.

4. Launch multiple processes where each process corresponds to a worker and has a sublist of items it must operate upon. Each process establishes an independent connection with the imap server and calls a function that performs the download.

The get_mails function launches a pool of workers and passes a list of sublists to mp_process function such that each worker gets one sublist of mail-ids to process. Virtually, each worker is executing its own copy of the mp_process function with its own specific sublist of mail-ids that it needs to download. All the workers execute in a parallel fashion with no dependence on each other.

5. Each worker downloads the mails serially using the sublist of mail-ids it possesses.

This function iterates over the sublist of mail-ids it receives from its calling process and begins downloading the sublist serially.

The complete code can be found in the appendix.

The general template that can be reused and adapted to various tasks is:

EXPERIMENTS

Several methods were compared in terms of execution time:

Serial download (Serial).
Multiprocessing download using the vanilla approach (MP-V).
Multiprocessing download using the splitting approach (code shown above) (MP-Sp).
Multithreaded download using the vanilla approach (Mth-V).
Multithreaded download using the splitting approach (Mth-Sp).

The number of threads/workers were also varied across different runs of multithreading/multiprocessing approaches (X-axis of the graph below).

RESULTS

The following conclusions can be drawn from the experiments:

The splitting approach always outperforms the vanilla approach and the serial approach regardless of the number of workers/threads.
The vanilla approach performs worse than the serial approach if the number of threads/workers are less than 8.
There is no real difference between the performance of multithreading and multiprocessing.
The vanilla approach tends to give better gains in terms of execution time as the number of workers/threads increase compared to the splitting approach. (Further experiments were carried out so that the vanilla approach uses more workers/threads, but the splitting approach was always better even when it used fewer workers/threads.)

Therefore, using a small number of workers/threads with the splitting approach ensures the quick completion of a task while only occupying a small amount of resources.

SUMMARY

An alternative approach of splitting item list into sublists for achieving better parallelisation in python was introduced and shown to be better than the usual approach wherein each item in the list is picked and processed by workers/threads. A practical example was used to demonstrate the performance improvement with the use of the splitting approach. Furthermore, a template to use this approach was developed which makes it possible to apply it regardless of the nature of the underlying task.

APPENDIX

ADDITIONAL QUIRKS

The experiments were carried out on a rather small set of data (only 1200 emails with 2KB per email). It is possible that vanilla approach could outperform the splitting approach for much larger datasets (Splitting approach seemed promising with the initial experimentation, so I ended up choosing it for my final task which involved downloading hundreds of Gigabytes of data across 3 million mails).
Multiprocessing vs. Multithreading: As can be seen from the results, these two approaches performed similarly. However, as far as python is concerned, Global Interpreter Lock (GIL) usually ends up hampering the multithreading performance and it is usually difficult to ascertain the usability of multithreading for a given use case in python. Multiprocessing, on the other hand, suffers from no such drawbacks and hence, I tend to prefer multiprocessing over multithreading in most cases. But then again, multiprocessing eats up a lot more resources than multithreading, so there’s usually a trade-off between multiprocessing and multithreading when it comes to python. (I chose multiprocessing for my final task).
Several runs were done for each experiment. During one of the runs, a strange behaviour was observed with the vanilla approach (for both multithreading and multiprocessing) — About 95% of tasks got executed rather quickly, but the last 5% took an extremely long time to finish. This issue wasn’t reproduced.

The results are largely empirical in nature and a satisfactory explanation of why splitting approach outperforms the vanilla approach still evades me. I would be happy to hear any ideas on this.