An Introduction to Multiprocessing Using Python

Is multithreading or multiprocessing faster than another?

Sieun Park

Published in

CodeX

10 min readApr 16, 2023

✰ Best viewed in [Notion]

Overview

In Python, multiprocessing and multithreading are primarily important for improved performance. Both multiprocessing and multithreading help maximize the utilization of a system’s CPU and other resources. By distributing tasks across multiple threads or processes, they enable parallel execution, which can lead to significant performance improvements. This is particularly beneficial for computationally intensive applications or when handling large datasets.

This article is organized in the following order:

Brief concepts and differences of process, thread.
Usage in Python.
Performance comparison in compute-bound and IO-bound tasks.
fork and spawn mode, psutils

Process

A program is static: it is the data and information itself that needs to be processed and executed, while a process is when the actual program is in memory and under the control of the CPU. A process consumes computing resources such as memory, usually RAM(random access memory), and the CPU register for fast access.

Thread

Threads run within some process. A process can have more than one thread and each thread shares the memory and resources of the process. Threads have their own stack but can access shared data. Because of the global interpreter lock in Python, using multiple threads can be concurrent but not truly parallel, which means utilizing multiple cores for additional speed.

Difference of using multiple threads and processes

Threads are a subset of some processes.
Unlike multiprocesses, where each process runs with its own memory, multithreads within a process can share memory which typically makes them faster and efficient.
Threads are more vulnerable to problems caused by other threads in the same process.
Because of the global interpreter lock in Python, threads can only run on a single CPU core. Therefore, multithreading can be concurrent but not truly parallel, i.e. utilizing 2+ CPU cores.
For IO-bound tasks, this doesn’t matter so much, but it limits gaining performance from multi-threading in compute-intensive tasks.

Simple usage in Python

Let’s apply multiprocessing and multithreading to a simple work_func function which takes 2 seconds to complete. To apply work_func to 12 numbers would normally take 12 * 2 = 24 seconds. In this case, distributing this load to 4 processes/threads can reduce the time to 24 / 4 = 6 seconds.

Multiprocessing

from multiprocessing import Pool
import time, os

def work_func(x):
    print("work_func:", x, "PID", os.getpid())
    time.sleep(2)
    return x**5

if __name__ == "__main__":
    start = time.time()
   
    cpu = 4
    with Pool(cpu) as p:
        print(p.map(work_func, range(0, 12)))
    # or simply:
    # pool = Pool(cpu)
    # print(pool.map(work_func, range(0, 12)))

    print("***run time(sec) :", time.time() - start)

[Out]: work_func: 0 PID 32411
work_func: 1 PID 32412
work_func: 2 PID 32414
work_func: 3 PID 32413
work_func: 4 PID 32411
work_func: 5 PID 32412
work_func: 6 PID 32414
work_func: 7 PID 32413
work_func: 8 PID 32411
work_func: 9 PID 32412
work_func: 10 PID 32414
work_func: 11 PID 32413
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049, 100000, 161051]
***run time(sec) : 6.1312620639801025

Multithreading

from multiprocessing.pool import ThreadPool
import time, os

def work_func(x):
    print("work_func:", x, "PID", os.getpid())
    time.sleep(2)
    return x**5

if __name__ == "__main__":
    start = time.time()
   
    cpu = 4
    with ThreadPool(cpu) as p:
        print(p.map(work_func, range(0, 12)))

    print("***run time(sec) :", time.time() - start)

[Out]: work_func: 0 PID 32395
work_func: 1 PID 32395
work_func: 2 PID 32395
work_func: 3 PID 32395
work_func: 4 PID 32395
work_func: 5 PID 32395
work_func: 6 PID 32395
work_func: 7 PID 32395
work_func: 8 PID 32395
work_func: 9 PID 32395
work_func: 11 PID 32395
work_func: 10 PID 32395
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049, 100000, 161051]
***run time(sec) : 6.024605989456177

Note that os.getpid returns the same value when we use multithreads, which means they all run within the same process. Using multiprocess with 4 processes launches 4 different processes between 32411 and 32414. Now, let’s measure how they perform differently for more realistic benchmarks.

Example 1. Compute-bound task

# f2 is applied to 10,000,000 random integers in [1, 10]
def f2(x):
    # very inefficient x^2
    s = 0
    for _ in range(x):
        s += x
    return s

For compute-intensive tasks, I implemented a very inefficient function for computing x² using a for-loop and measured the performance on 10M integers. This is not a rigorous, controlled comparison and there are many reasons for the observed differences in performance. The results will vary based on the device used for testing.

Results:

Using multithreading doesn’t give any performance boost because of the global interpreter lock.
Though there is some additional latency compared to the theoretical performance (latency at 1 process divided by n), multiprocessing gives consistent speed boosts until 4 processes.
This was interesting because my local CPU should have 8 cores (os.cpu_count() gives 8), and I expected the performance of multiprocessing to increase until 8 cores. Although, a later investigation showed that I actually only had 4 physical cores.

Example 2. IO-bound task

def make_random_png(base_path: str, n: int = 100) -> None:
    # make n random image with random size between 500 and 1000 
    for i in range(n):
        width = random.randint(500, 1000)
        height = random.randint(500, 1000)
        img = np.random.randint(0, 255, (height, width, 3), dtype=np.uint8)
        img = Image.fromarray(img)
        img.save(os.path.join(base_path, f"{i}.png"))


def convert_to_jpeg(img_path: str) -> str:
    # convert single image to jpeg
    #print(f"converting: {img_path}")
    target_path = img_path.replace("png_files", "jpeg_files")
    img = Image.open(img_path)
    img.save(target_path, "JPEG", quality=20)
    return target_path
...
png_path = f"{base_path}/png_files"
make_random_png(png_path, 100)
...
with pool_type(processes) as p:
        out = p.map(convert_to_jpeg, glob(png_path+"/*.png"))
...

In this example, I measured the time taken to read png files and write them in jpeg format. I used 100 random images with a width/height between 500 and 1,000. The latency for this task is mostly determined by the ability to quickly read and write images.

Results:

Since the speed is mostly bottlenecked by IO instead of compute in this example, threading gives significant boosts.
Multithreading is much faster compared to multiprocessing.
Again, since this performance is IO-bounded, the speed keeps on slightly increasing even after 4, or 8 threads.

Spawn, fork, and threading

The Python multiprocessing module has two different ways of starting and operating multiple processes: forking and spawning. Briefly speaking,

Spawn:

Allocate new resources to a subprocess as needed without inheriting resources from the main process.
Slow, but safe.
Supported after Python 3.8 and the default operation for Python ≥ 3.12

Fork:

Share resources(memory) of the main process with the subprocess and start the process.
Fast, but dangerous.
The default operation for Python < 3.12.

For more detail on the concepts and the reasons why Python migrated to spawn, refer to this amazing article by Itamar Turner-Trauring. Now in terms of code, how can we switch between them and how do they actually behave?

from multiprocessing import Pool, get_context
from multiprocessing.pool import ThreadPool
import time, os
import psutil

def work_func(x):
    p = psutil.Process()
    print("work_func:", x, "PID", os.getpid(), "  |  cpu_id", p.cpu_num())
    time.sleep(1)
    return x**5

p = psutil.Process()
print("> Started, __name__:", __name__, "\\tPID", os.getpid(), "\\tcpu_affinity", psutil.cpu_count())
if __name__ == "__main__":
    print("psutil_count:", psutil.cpu_count())
    start = time.time()
    cpu = 4
    pool = {
        "thread": ThreadPool,
        "fork": get_context("fork").Pool,
        "spawn": get_context("spawn").Pool,
    }
    print(pool[POOL_METHOD]().map(work_func, range(0, 10)))
    print("***run time(sec) :", time.time() - start)

The code above executes different multiprocessing methods using get_context. Note that psutil.Process().cpu_num() prints the CPU ID of which the function is being executed. Can you guess what the code above will produce, for each POOL_METHOD?

thread: the function is always called with a fixed cpu_id value, using the same process.

> Started, __name__: __main__   PID 598         cpu_affinity 4
psutil_count: 4
work_func: 0 PID 598   |  cpu_id 2
work_func: 1 PID 598   |  cpu_id 2
work_func: 3 PID 598   |  cpu_id 2
work_func: 2 PID 598   |  cpu_id 2
work_func: 4 PID 598   |  cpu_id 2
work_func: 5 PID 598   |  cpu_id 2
work_func: 6 PID 598   |  cpu_id 2
work_func: 7 PID 598   |  cpu_id 2
work_func: 8 PID 598   |  cpu_id 2
work_func: 9 PID 598   |  cpu_id 2
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049]
***run time(sec) : 3.0109732151031494

fork: the function is called on 4 different processes(PID) each with a different cpu_id value ranging between [0, 3].

> Started, __name__: __main__   PID 581         cpu_affinity 4
psutil_count: 4
work_func: 0 PID 582   |  cpu_id 1
work_func: 3 PID 585   |  cpu_id 2
work_func: 1 PID 583   |  cpu_id 3
work_func: 2 PID 584   |  cpu_id 0
work_func: 4 PID 582   |  cpu_id 1
work_func: 5 PID 583   |  cpu_id 3
work_func: 6 PID 585   |  cpu_id 0
work_func: 7 PID 584   |  cpu_id 0
work_func: 8 PID 583   |  cpu_id 3
work_func: 9 PID 582   |  cpu_id 1
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049]

spawn: the actual program itself is executed multiple times with a different cpu_id. The execution time takes slightly longer.

> Started, __name__: __main__   PID 589         cpu_affinity 4
psutil_count: 4
> Started, __name__: __mp_main__        PID 591         cpu_affinity 4
work_func: 0 PID 591   |  cpu_id 3
> Started, __name__: __mp_main__        PID 593         cpu_affinity 4
work_func: 1 PID 593   |  cpu_id 1
> Started, __name__: __mp_main__        PID 592         cpu_affinity 4
work_func: 2 PID 592   |  cpu_id 0
> Started, __name__: __mp_main__        PID 594         cpu_affinity 4
work_func: 3 PID 594   |  cpu_id 2
work_func: 4 PID 591   |  cpu_id 3
work_func: 5 PID 593   |  cpu_id 1
work_func: 6 PID 592   |  cpu_id 0
work_func: 7 PID 594   |  cpu_id 2
work_func: 8 PID 591   |  cpu_id 3
work_func: 9 PID 593   |  cpu_id 1
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049]
***run time(sec) : 3.0889203548431396

I noticed this the hard way,

from multiprocessing import Pool
from multiprocessing.pool import ThreadPool
import time, os

def work_func(x):
    print("work_func:", x, "PID", os.getpid())
    time.sleep(1)
    return x**5
print("os_count:", os.cpu_count())
start = time.time()
cpu = 4
pool = Pool(cpu)
print(pool.map(work_func, range(0, 10)))
print("***run time(sec) :", time.time() - start)

After checking that the code above was functioning in COLAB, I copied the code locally to a Python file (test.py) and ran !python test.py, expecting the same result. Like always, some unexpected errors happened:

File "/Users/sieunpark/Documents/parallelization/test.py", line 14, in <module>
    pool = Pool(cpu)
           ^^^^^^^^^
  File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 329, in _repopulate_pool_static
    w.start()
  File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
 module:
if __name__ == '__main__':
                freeze_support()
                ...
        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
    return Popen(process_obj)
...

The error was only happening for multiprocessing using Pool but not ThreadPool, and the terminal kept printing the error forever until I force-quit.

Weirdly, the error didn’t happen after adding if __name__ == “__main__":to the main block. After searching for an hour, I found out that the reason was that the pooling mode defaults to spawn on my environment, while it defaults to fork in COLAB. As observed above, in spawn mode, the Python file is read again with different __name__ for every process which triggered the code that again launches new processes!

Using `psutils` to control visible CPU

The following example was executed on a Debian 11 machine with 4 CPU cores. Running the baseline code with multiprocessing allocates the workload evenly to all the 4 CPU cores.

from multiprocessing import Pool, get_context
from multiprocessing.pool import ThreadPool
import time, os
import psutil

def work_func(x):
    p = psutil.Process()
    print("work_func:", x, "PID", os.getpid(), "  |  cpu_id", p.cpu_num())
    time.sleep(1)
    return x**5
p = psutil.Process()
print("> Started, __name__:", __name__, "\\tPID", os.getpid(), "\\tcpu_affinity", psutil.cpu_count())
if __name__ == "__main__":
    print("psutil_count:", psutil.cpu_count())
    start = time.time()
    cpu = 4
    pool = Pool(cpu)
    print(pool.map(work_func, range(0, 10)))
    print("***run time(sec) :", time.time() - start)


>>> > Started, __name__: __main__   PID 414         cpu_affinity 4
work_func: 0 PID 415   |  cpu_id 3
work_func: 1 PID 416   |  cpu_id 1
work_func: 2 PID 417   |  cpu_id 2
work_func: 3 PID 418   |  cpu_id 0
work_func: 4 PID 416   |  cpu_id 1
work_func: 5 PID 415   |  cpu_id 3
work_func: 6 PID 418   |  cpu_id 0
work_func: 7 PID 417   |  cpu_id 1
work_func: 8 PID 416   |  cpu_id 1
work_func: 9 PID 417   |  cpu_id 3
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049]
***run time(sec) : 3.0126681327819824

The visible CPU cores can be controlled using psutils library. For example, adding the following code constrains the program to cores [1, 3].

if __name__ == "__main__":
    print("psutil_count:", psutil.cpu_count())
    print("before:", p.cpu_affinity())
    p.cpu_affinity([1, 3])
    print("after:", p.cpu_affinity())

    start = time.time()
    cpu = 4
    pool = get_context("fork").Pool(cpu)
    print(pool.map(work_func, range(0, 10)))
    print("***run time(sec) :", time.time() - start)


>>> > Started, __name__: __main__   PID 424         cpu_affinity 4
psutil_count: 4
before: [0, 1, 2, 3]
after: [1, 3]
work_func: 0 PID 425   |  cpu_id 3
work_func: 1 PID 426   |  cpu_id 1
work_func: 2 PID 427   |  cpu_id 1
work_func: 3 PID 428   |  cpu_id 3
work_func: 4 PID 427   |  cpu_id 1
work_func: 5 PID 428   |  cpu_id 3
work_func: 6 PID 425   |  cpu_id 1
work_func: 7 PID 426   |  cpu_id 3
work_func: 8 PID 427   |  cpu_id 1
work_func: 9 PID 428   |  cpu_id 3
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049]
***run time(sec) : 3.0134823322296143

Jumping back to the problem on my MAC device where, even though os.cpu_count and multiprocessing.cpu_count told me I had 8 cores, I only saw 4 cores being utilized. I initially thought that somehow the visible CPU cores were restricted to 4 because of issues in Python, and maybe this can be fixed by psutil like in the example above. Interestingly, checking the CPU affinity with psutil gave 4. This was because my i5 has 8 logical processors, but only 4 physical cores, which was what matters in this case.

from multiprocessing import cpu_count as mp_cpu_count
import os
import psutil

print("os:", os.cpu_count())
print("mp:", mp_cpu_count())
print("psutil:", psutil.cpu_count())
print("psutil-physical:", psutil.cpu_count(logical=False))
os: 8
mp: 8
psutil: 4
psutil-physical: 4

This might have not mattered anyways since, p.cpu_affinity is not supported for MAC and this answer suggests that it is currently impossible to set the CPU affinity for my device 😥.

In this post, we looked at the characteristics of multiprocessing and multithreading, and when each is useful. We also looked at how to use the multiprocessing module and the difference between spawn, fork modes. Finally, we briefly looked at psutils for more advanced control.

Useful resources:

More on processes and threads: https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/
More on Python multiprocessing and the problem of using fork: https://pythonspeed.com/articles/python-multiprocessing/