An Introduction to Multiprocessing Using Python
Is multithreading or multiprocessing faster than another?
✰ Best viewed in [Notion]
Overview
In Python, multiprocessing and multithreading are primarily important for improved performance. Both multiprocessing and multithreading help maximize the utilization of a system’s CPU and other resources. By distributing tasks across multiple threads or processes, they enable parallel execution, which can lead to significant performance improvements. This is particularly beneficial for computationally intensive applications or when handling large datasets.
This article is organized in the following order:
- Brief concepts and differences of process, thread.
- Usage in Python.
- Performance comparison in compute-bound and IO-bound tasks.
- fork and spawn mode,
psutils
Process
A program is static: it is the data and information itself that needs to be processed and executed, while a process is when the actual program is in memory and under the control of the CPU. A process consumes computing resources such as memory, usually RAM(random access memory), and the CPU register for fast access.
Thread
Threads run within some process. A process can have more than one thread and each thread shares the memory and resources of the process. Threads have their own stack but can access shared data. Because of the global interpreter lock in Python, using multiple threads can be concurrent
but not truly parallel
, which means utilizing multiple cores for additional speed.
Difference of using multiple threads and processes
- Threads are a subset of some processes.
- Unlike multiprocesses, where each process runs with its own memory, multithreads within a process can share memory which typically makes them faster and efficient.
- Threads are more vulnerable to problems caused by other threads in the same process.
- Because of the global interpreter lock in Python, threads can only run on a single CPU core. Therefore, multithreading can be
concurrent
but not trulyparallel
, i.e. utilizing 2+ CPU cores. - For IO-bound tasks, this doesn’t matter so much, but it limits gaining performance from multi-threading in compute-intensive tasks.
Simple usage in Python
Let’s apply multiprocessing and multithreading to a simple work_func
function which takes 2 seconds to complete. To apply work_func
to 12 numbers would normally take 12 * 2 = 24 seconds. In this case, distributing this load to 4 processes/threads can reduce the time to 24 / 4 = 6 seconds.
Multiprocessing
from multiprocessing import Pool
import time, os
def work_func(x):
print("work_func:", x, "PID", os.getpid())
time.sleep(2)
return x**5
if __name__ == "__main__":
start = time.time()
cpu = 4
with Pool(cpu) as p:
print(p.map(work_func, range(0, 12)))
# or simply:
# pool = Pool(cpu)
# print(pool.map(work_func, range(0, 12)))
print("***run time(sec) :", time.time() - start)
[Out]: work_func: 0 PID 32411
work_func: 1 PID 32412
work_func: 2 PID 32414
work_func: 3 PID 32413
work_func: 4 PID 32411
work_func: 5 PID 32412
work_func: 6 PID 32414
work_func: 7 PID 32413
work_func: 8 PID 32411
work_func: 9 PID 32412
work_func: 10 PID 32414
work_func: 11 PID 32413
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049, 100000, 161051]
***run time(sec) : 6.1312620639801025
Multithreading
from multiprocessing.pool import ThreadPool
import time, os
def work_func(x):
print("work_func:", x, "PID", os.getpid())
time.sleep(2)
return x**5
if __name__ == "__main__":
start = time.time()
cpu = 4
with ThreadPool(cpu) as p:
print(p.map(work_func, range(0, 12)))
print("***run time(sec) :", time.time() - start)
[Out]: work_func: 0 PID 32395
work_func: 1 PID 32395
work_func: 2 PID 32395
work_func: 3 PID 32395
work_func: 4 PID 32395
work_func: 5 PID 32395
work_func: 6 PID 32395
work_func: 7 PID 32395
work_func: 8 PID 32395
work_func: 9 PID 32395
work_func: 11 PID 32395
work_func: 10 PID 32395
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049, 100000, 161051]
***run time(sec) : 6.024605989456177
Note that os.getpid
returns the same value when we use multithreads, which means they all run within the same process. Using multiprocess with 4 processes launches 4 different processes between 32411 and 32414. Now, let’s measure how they perform differently for more realistic benchmarks.
Example 1. Compute-bound task
# f2 is applied to 10,000,000 random integers in [1, 10]
def f2(x):
# very inefficient x^2
s = 0
for _ in range(x):
s += x
return s
For compute-intensive tasks, I implemented a very inefficient function for computing x² using a for-loop and measured the performance on 10M integers. This is not a rigorous, controlled comparison and there are many reasons for the observed differences in performance. The results will vary based on the device used for testing.
Results:
- Using multithreading doesn’t give any performance boost because of the global interpreter lock.
- Though there is some additional latency compared to the theoretical performance (latency at 1 process divided by n), multiprocessing gives consistent speed boosts until 4 processes.
- This was interesting because my local CPU should have 8 cores (
os.cpu_count()
gives 8), and I expected the performance of multiprocessing to increase until 8 cores. Although, a later investigation showed that I actually only had 4 physical cores.
Example 2. IO-bound task
def make_random_png(base_path: str, n: int = 100) -> None:
# make n random image with random size between 500 and 1000
for i in range(n):
width = random.randint(500, 1000)
height = random.randint(500, 1000)
img = np.random.randint(0, 255, (height, width, 3), dtype=np.uint8)
img = Image.fromarray(img)
img.save(os.path.join(base_path, f"{i}.png"))
def convert_to_jpeg(img_path: str) -> str:
# convert single image to jpeg
#print(f"converting: {img_path}")
target_path = img_path.replace("png_files", "jpeg_files")
img = Image.open(img_path)
img.save(target_path, "JPEG", quality=20)
return target_path
...
png_path = f"{base_path}/png_files"
make_random_png(png_path, 100)
...
with pool_type(processes) as p:
out = p.map(convert_to_jpeg, glob(png_path+"/*.png"))
...
In this example, I measured the time taken to read png files and write them in jpeg format. I used 100 random images with a width/height between 500 and 1,000. The latency for this task is mostly determined by the ability to quickly read and write images.
Results:
- Since the speed is mostly bottlenecked by IO instead of compute in this example, threading gives significant boosts.
- Multithreading is much faster compared to multiprocessing.
- Again, since this performance is IO-bounded, the speed keeps on slightly increasing even after 4, or 8 threads.
Spawn, fork, and threading
The Python multiprocessing module has two different ways of starting and operating multiple processes: forking and spawning. Briefly speaking,
Spawn
:
- Allocate new resources to a subprocess as needed without inheriting resources from the main process.
- Slow, but safe.
- Supported after Python 3.8 and the default operation for Python ≥ 3.12
Fork
:
- Share resources(memory) of the main process with the subprocess and start the process.
- Fast, but dangerous.
- The default operation for Python < 3.12.
For more detail on the concepts and the reasons why Python migrated to spawn
, refer to this amazing article by Itamar Turner-Trauring. Now in terms of code, how can we switch between them and how do they actually behave?
from multiprocessing import Pool, get_context
from multiprocessing.pool import ThreadPool
import time, os
import psutil
def work_func(x):
p = psutil.Process()
print("work_func:", x, "PID", os.getpid(), " | cpu_id", p.cpu_num())
time.sleep(1)
return x**5
p = psutil.Process()
print("> Started, __name__:", __name__, "\\tPID", os.getpid(), "\\tcpu_affinity", psutil.cpu_count())
if __name__ == "__main__":
print("psutil_count:", psutil.cpu_count())
start = time.time()
cpu = 4
pool = {
"thread": ThreadPool,
"fork": get_context("fork").Pool,
"spawn": get_context("spawn").Pool,
}
print(pool[POOL_METHOD]().map(work_func, range(0, 10)))
print("***run time(sec) :", time.time() - start)
The code above executes different multiprocessing methods using get_context
. Note that psutil.Process().cpu_num()
prints the CPU ID of which the function is being executed. Can you guess what the code above will produce, for each POOL_METHOD
?
thread
: the function is always called with a fixedcpu_id
value, using the same process.
> Started, __name__: __main__ PID 598 cpu_affinity 4
psutil_count: 4
work_func: 0 PID 598 | cpu_id 2
work_func: 1 PID 598 | cpu_id 2
work_func: 3 PID 598 | cpu_id 2
work_func: 2 PID 598 | cpu_id 2
work_func: 4 PID 598 | cpu_id 2
work_func: 5 PID 598 | cpu_id 2
work_func: 6 PID 598 | cpu_id 2
work_func: 7 PID 598 | cpu_id 2
work_func: 8 PID 598 | cpu_id 2
work_func: 9 PID 598 | cpu_id 2
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049]
***run time(sec) : 3.0109732151031494
fork
: the function is called on 4 different processes(PID) each with a differentcpu_id
value ranging between [0, 3].
> Started, __name__: __main__ PID 581 cpu_affinity 4
psutil_count: 4
work_func: 0 PID 582 | cpu_id 1
work_func: 3 PID 585 | cpu_id 2
work_func: 1 PID 583 | cpu_id 3
work_func: 2 PID 584 | cpu_id 0
work_func: 4 PID 582 | cpu_id 1
work_func: 5 PID 583 | cpu_id 3
work_func: 6 PID 585 | cpu_id 0
work_func: 7 PID 584 | cpu_id 0
work_func: 8 PID 583 | cpu_id 3
work_func: 9 PID 582 | cpu_id 1
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049]
spawn
: the actual program itself is executed multiple times with a differentcpu_id
. The execution time takes slightly longer.
> Started, __name__: __main__ PID 589 cpu_affinity 4
psutil_count: 4
> Started, __name__: __mp_main__ PID 591 cpu_affinity 4
work_func: 0 PID 591 | cpu_id 3
> Started, __name__: __mp_main__ PID 593 cpu_affinity 4
work_func: 1 PID 593 | cpu_id 1
> Started, __name__: __mp_main__ PID 592 cpu_affinity 4
work_func: 2 PID 592 | cpu_id 0
> Started, __name__: __mp_main__ PID 594 cpu_affinity 4
work_func: 3 PID 594 | cpu_id 2
work_func: 4 PID 591 | cpu_id 3
work_func: 5 PID 593 | cpu_id 1
work_func: 6 PID 592 | cpu_id 0
work_func: 7 PID 594 | cpu_id 2
work_func: 8 PID 591 | cpu_id 3
work_func: 9 PID 593 | cpu_id 1
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049]
***run time(sec) : 3.0889203548431396
I noticed this the hard way,
from multiprocessing import Pool
from multiprocessing.pool import ThreadPool
import time, os
def work_func(x):
print("work_func:", x, "PID", os.getpid())
time.sleep(1)
return x**5
print("os_count:", os.cpu_count())
start = time.time()
cpu = 4
pool = Pool(cpu)
print(pool.map(work_func, range(0, 10)))
print("***run time(sec) :", time.time() - start)
After checking that the code above was functioning in COLAB, I copied the code locally to a Python file (test.py
) and ran !python test.py
, expecting the same result. Like always, some unexpected errors happened:
File "/Users/sieunpark/Documents/parallelization/test.py", line 14, in <module>
pool = Pool(cpu)
^^^^^^^^^
File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 215, in __init__
self._repopulate_pool()
File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 306, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 329, in _repopulate_pool_static
w.start()
File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
return Popen(process_obj)
...
The error was only happening for multiprocessing using Pool but not ThreadPool, and the terminal kept printing the error forever until I force-quit.
Weirdly, the error didn’t happen after adding if __name__ == “__main__":
to the main block. After searching for an hour, I found out that the reason was that the pooling mode defaults to spawn
on my environment, while it defaults to fork
in COLAB. As observed above, in spawn
mode, the Python file is read again with different __name__
for every process which triggered the code that again launches new processes!
Using psutils
to control visible CPU
The following example was executed on a Debian 11 machine with 4 CPU cores. Running the baseline code with multiprocessing allocates the workload evenly to all the 4 CPU cores.
from multiprocessing import Pool, get_context
from multiprocessing.pool import ThreadPool
import time, os
import psutil
def work_func(x):
p = psutil.Process()
print("work_func:", x, "PID", os.getpid(), " | cpu_id", p.cpu_num())
time.sleep(1)
return x**5
p = psutil.Process()
print("> Started, __name__:", __name__, "\\tPID", os.getpid(), "\\tcpu_affinity", psutil.cpu_count())
if __name__ == "__main__":
print("psutil_count:", psutil.cpu_count())
start = time.time()
cpu = 4
pool = Pool(cpu)
print(pool.map(work_func, range(0, 10)))
print("***run time(sec) :", time.time() - start)
>>> > Started, __name__: __main__ PID 414 cpu_affinity 4
work_func: 0 PID 415 | cpu_id 3
work_func: 1 PID 416 | cpu_id 1
work_func: 2 PID 417 | cpu_id 2
work_func: 3 PID 418 | cpu_id 0
work_func: 4 PID 416 | cpu_id 1
work_func: 5 PID 415 | cpu_id 3
work_func: 6 PID 418 | cpu_id 0
work_func: 7 PID 417 | cpu_id 1
work_func: 8 PID 416 | cpu_id 1
work_func: 9 PID 417 | cpu_id 3
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049]
***run time(sec) : 3.0126681327819824
The visible CPU cores can be controlled using psutils
library. For example, adding the following code constrains the program to cores [1, 3]
.
if __name__ == "__main__":
print("psutil_count:", psutil.cpu_count())
print("before:", p.cpu_affinity())
p.cpu_affinity([1, 3])
print("after:", p.cpu_affinity())
start = time.time()
cpu = 4
pool = get_context("fork").Pool(cpu)
print(pool.map(work_func, range(0, 10)))
print("***run time(sec) :", time.time() - start)
>>> > Started, __name__: __main__ PID 424 cpu_affinity 4
psutil_count: 4
before: [0, 1, 2, 3]
after: [1, 3]
work_func: 0 PID 425 | cpu_id 3
work_func: 1 PID 426 | cpu_id 1
work_func: 2 PID 427 | cpu_id 1
work_func: 3 PID 428 | cpu_id 3
work_func: 4 PID 427 | cpu_id 1
work_func: 5 PID 428 | cpu_id 3
work_func: 6 PID 425 | cpu_id 1
work_func: 7 PID 426 | cpu_id 3
work_func: 8 PID 427 | cpu_id 1
work_func: 9 PID 428 | cpu_id 3
[0, 1, 32, 243, 1024, 3125, 7776, 16807, 32768, 59049]
***run time(sec) : 3.0134823322296143
Jumping back to the problem on my MAC device where, even though os.cpu_count
and multiprocessing.cpu_count
told me I had 8 cores, I only saw 4 cores being utilized. I initially thought that somehow the visible CPU cores were restricted to 4 because of issues in Python, and maybe this can be fixed by psutil
like in the example above. Interestingly, checking the CPU affinity with psutil
gave 4. This was because my i5 has 8 logical processors, but only 4 physical cores, which was what matters in this case.
from multiprocessing import cpu_count as mp_cpu_count
import os
import psutil
print("os:", os.cpu_count())
print("mp:", mp_cpu_count())
print("psutil:", psutil.cpu_count())
print("psutil-physical:", psutil.cpu_count(logical=False))
os: 8
mp: 8
psutil: 4
psutil-physical: 4
This might have not mattered anyways since, p.cpu_affinity
is not supported for MAC and this answer suggests that it is currently impossible to set the CPU affinity for my device 😥.
In this post, we looked at the characteristics of multiprocessing and multithreading, and when each is useful. We also looked at how to use the multiprocessing
module and the difference between spawn, fork modes. Finally, we briefly looked at psutils
for more advanced control.
Useful resources:
- More on processes and threads: https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/
- More on Python multiprocessing and the problem of using fork: https://pythonspeed.com/articles/python-multiprocessing/