Intro to Multiprocessing on Python from a guy with almost 12 months of rock hard technical training.

Sherman Lee
Analytics Vidhya
Published in
4 min readJul 7, 2020

Good day to you and thank you for reading my first post. In this tutorial I will go over a few methods of multiprocessing as well as highlight a few use cases that I have come across.

Suppose we are looking to scrape and clean data from various sources, we could do all our tasks sequentially, or we could run them concurrently. Realistically, you should be asking yourself this, would you rather tire your hands by eating your gummy bears one by one or would you rather just stuff the whole pack into your mouth and let your useless jaw do the work?

First, you need to install the multiprocessing package using pip

pip install multiprocessing

Just kidding it comes with all standard versions of python since Python 2.6.

So simply import the functions that we will be using.

from multiprocessing import Process, Pool, Manager, Queue

Next, let us write a dummy function to demonstrate how to use some of the basic functionalities in multiprocessing.

def doComplexCalculation (num) :
otherNum = 9
if num == 10 :
return 21

return num + otherNum

Should we need to perform doComplexCalculation numerous times we can always create a pool of workers to carry out the task.

pool = Pool(5)result = pool.map(doComplexCalculation, [25, 42, 69])print (result)// [34, 51, 78]

Simple enough right? Now what if your function requires special keyword arguments?

def doEvenMoreComplexStuff (num, isMutable=False, isDirty=True) :    otherNum = 9    if isMutable == True :        otherNum += 6    if isDirty == False :        num += 6    return num + otherNum

For this the more straight forward method would be to generate a Process object and run those pseudo-simultaneously.

processes = []nums = [25, 42, 69]
for i in range(3): p = Process( target=doEvenMoreComplexStuff, kwargs={"isMutable":True, "isDirty":False, "num":nums[i]}, ) processes.append(p)for p in processes : p.start()for p in processes : p.join()

I will now break down the code for your pretty little head to understand. This first piece of code generate a Process object and maps a function with its required arguments. Do note that when declaring the target to avoid adding “()” at the end of the function as that tells Python to execute said function at that line.

p = Process(    target=doEvenMoreComplexStuff,    kwargs={"isMutable":True, "isDirty":False, "num":nums[i]},)

The method .start() basically “starts” the process. Whereas .join() ensures that these Processes you have started are completed before the rest of the main code is executed. We separate the start and join methods as if they were to execute in the same for-loop your interpreter would simply be waiting for the previous process to join before starting the next which essentially defeats the purpose of multiprocessing.

for p in processes :    p.start()for p in processes :    p.join()

Lastly, the astute reader (or those weirdos that actually reads the full article) might realise that you are unable to retrieve the returned value from these functions. And you would be exactly right!

You might be thinking that you could just pass in a dictionary or list into the variable to store the values. However, these functions are executed in a separate memory block from the main function as such you would be unable to retrieve the returned values by using the usual dictionary and list. Instead you would need to use the Manager and Queue class.

A Queue is used when you are not particularly interested in the order in which your results are returned. A Queue also requires less computational power than a Manager class.

A Manager class is used when you are interested in the order in which objects are returned. This is particularly useful if you are required to upload files into a S3 bucket for example. I personally use the dict function but feel free to experiment with other functions in the Manager class.

Do note that the Manager class will require a processing core for itself so do take that into consideration, while a Queue does not.

To implement either a Manager.dict() or Queue, we need to add a keyword argument into our function.

## Implementation of a Manager.dict()
def doComplexCalculation (num, queue=None) :
otherNum = 9
if num == 10 :
return 21
if queue != None :
queue['complexCalculation'] = num + otherNum

return num + otherNum
## Implementation of a Queue
def doComplexCalculation (num, queue=None) :
otherNum = 9
if num == 10 :
return 21
if queue != None :
queue.put( num + otherNum )

return num + otherNum

Now to put it all together.

from multiprocessing import Process, Manager## Implementation of a Manager.dict()def doComplexCalculation (num, queue=None) :
otherNum = 9
if num == 10 :
result = 21
else :
result = num + otherNum
if queue != None :
queue[f'complexCalculation for {num}'] = result
return resultq = Manager().dict()processes = []process1 = Process(target=doComplexCalculation, kwargs={'num': 42, 'queue': q})
process2 = Process(target=doComplexCalculation, kwargs={'num': 69, 'queue': q})
process3 = Process(target=doComplexCalculation, kwargs={'num': 10, 'queue': q})
processes.append(process1)
processes.append(process2)
processes.append(process3)
for p in processes :
p.start()
for p in processes :
p.join()
print (q)"""
{'complexCalculation for 42': 51, 'complexCalculation for 69': 78, 'complexCalculation for 10': 21}
"""

That’s all she wrote!

--

--