When Do We Need Multiprocessing?
When you handle tons of text files or images, you might want to use multiprocessing to speed up the processing. An intuitive way in Python is below:
import multiprocessingwith multiprocessing.Pool() as p:
result = p.map(lambda x: x ** 2, range(100))
But unfortunately, this won’t work because you cannot write a lambda function or a closure with multiprocessing. As for the reason, you can find it in Why? section below.
When you google this problem, you’ll find someone suggests you use joblib or pathos or something like that. But why should we use the external library to try a really small snippet? Of course, these libraries are quite awesome though.
So today, I will first explain the multiprocessing’s restriction, why we cannot use multiprocessing with a lambda function. And then I will introduce a little bit tricky but a pure-Python way to make it work. (I will explain the detail of the code above.)
Let’s get started!
Here, I’ll give you a brief explanation of the reason why we should do this. In Python’s multiprocessing, it sends the task which is executed in child processes. The content of the task is below:
yield (func, x)
You can find the original code here.
func is the function and
x is the item in iterable, passed to
Take a look at this line below in pool.py:
except Exception as e:
put(task) sends the task to child processes. (Technically this isn’t precise, but I think this explanation is enough for us to understand what’s going on the inside.) And the task contains the function and the item as I mentioned above.
The implementation of
put is here:
def send(self, obj):
"""Send a (picklable) object"""
Now we see the
_ForkingPicker inherits the Python’s standard pickle, so multiprocessing wants pickleable Python objects. When we pickle a Python object, we’ll get the bytes version of the Python object:
>>> import pickle
>>> pickle.dumps([0, 1, 2])
But before sending the task to the child processes, why should we pickle (serialize) it? In other words, why should we send the bytes version?
This is because multiprocessing uses a pipe to send the task (Technically, inter-process communication). And a pipe doesn’t accept Python objects. We can find multiprocessing uses a pipe here and sends the task to a pipe in this line:
n = write(self._handle, buf)
self._handle is a writable file descriptor, and
buf is a pickled Python object. This is the reason why the function and each item in iterable should be picklable.
The solution might be tricky but it’s really simple. The code below is the core concept:
You cannot directly pass a lambda function to
Pool.map, but you can use a lambda function inside the function passed to
worker_init(lambda x: x ** 2) will be executed before
worker(x) is execute. So
worker makes sure to be able to refer
worker is executed in the child processes made by
os.fork in Python. And
os.fork duplicate the main process, so
worker can access the global variables which include a lambda function.
Now you can use multiprocessing with a lambda function. This isn’t a practical usage in many cases but it helps us understand the way to avoid restriction of multiprocessing. And this isn’t really a practical code. So I change it slightly and show it in TL;DR section above.
That’s all for this post. Thank you for reading! I introduce the way how to use multiprocessing with a lambda function. I hope this post will help you!