Multiprocessing Serialization in Python with Pickle

On the last few weeks working on diffoscope I’ve came to realise how far we, as humankind, still are from being able to easily parallel large or complex chunks of code. In this article I’ll be exploring a little about what multiprocessing does with your data behind the curtains and why this knowledge makes our lives easier when trying to parallelize code.

Since we’re diving into specifics, these are some sources to learn more about therminology and basic concepts of parallelism on python:

A Briefing on Processes and Threads

Processes have at least one thread. These threads, inside a process, share memory space. Threads are not interruptible or killable, whereas processes are heavier and take longer to spawn. Parallelism is achieved when two or more tasks are executed simultaneously; this can be done through threads or processes.

In Python, threading is the module to work with threads, while multiprocessing works with processes (and threads underneath it). The latter has many abstractions to make our lives easier, such as Pools and Managers, and is able to supervise thread execution without us having to deal with Pythons Hardest Problem.

Since processes do not share memory space, they need a different and more complex ways of sending information than threads. As mentioned by Matthew Rockling in his article Parallelism and Serialization, processes send information much like the teleporters in Star Trek: a package is converted to something understandable by the teleporter, sent, and transformed back into its package form. This process is called serialization, and Python uses the Pickle library to handle translating the object back and forth.

How far does Pickling go?

Pickle is able to serialize and deserialize Python objects into bytestream. It does work well on most cases — with reservations. When multiprocessing spawns a process, Pickle is called by default to handle data transfer. A simple example of how this is done follows:

import pickleclass Foo(object):
@property
def bar(self):
return ‘bar’
>>> f = Foo()
>>> p = pickle.dumps(f)
>>> p
b'\x80\x03c__main__\nFoo\nq\x00)\x81q\x01.'
>>> u = pickle.loads(p)
>>> u.bar
'bar'

As you may notice, a Pickled Foo()is its function name. According to Pickle’s documentation:

Note that functions (built-in and user-defined) are pickled by “fully qualified” name reference, not by value. [2] This means that only the function name is pickled, along with the name of the module the function is defined in. Neither the function’s code, nor any of its function attributes are pickled. Thus the defining module must be importable in the unpickling environment, and the module must contain the named object, otherwise an exception will be raised.

That said, we can imagine scenarios in which some objects won’t be picklable. Lambdas are one of them, since they are not importable.

>>> f = lambda x: x**2
>>> p = pickle.dumps(f)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_pickle.PicklingError: Can't pickle <function <lambda> at 0x7fbd12f29510>: attribute lookup <lambda> on __main__ failed

And more comonly, Inner Classes:

>>> class Foo():
... def __init__(self):
... class Bar():
... @property
... def bar(self):
... return 'bar'
... self.bar = Bar()
...
>>> f = Foo()
>>> p = pickle.dumps(f)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: Can't pickle local object 'Foo.__init__.<locals>.Bar'

Back to Multiprocessing

Ok, so we know Pickle’s limitations. What about multiprocessing? Well, since multiprocessing uses Pickle, it inherits the same flaws. And, as much as we’d like to believe that there are just a couple of cases in which Pickle fails, achieving parallelization is a complex task that may expose many non expected errors.

When your code is not picklabe by default, you have two options: rewrite your code in order to make classes importable by name or try some alternatives. I’ll be talking about the latter on a next post.

This post is part of a series about diffoscope development for the Outreachy Project. Diffoscope is a comparison tool, part of the Reproducible Builds effort. In the following weeks, I’ll be writing about development details, techniques and general discussion around diffoscope.

Written by

Developer, Livre Software enthusiast.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store