Python multiprocessing and why it’s not always the best solution
Hello World!
Today we are going to take a look at Python multiprocessing and in particular at the Pool module.
Let’s start by defining a simple function:
This function takes as an input a word, removes the punctuation characters and then if the word is longer than 5 characters it returns that word.
Now we need some text to use with this function:
If you take a closer look at this line of code, you’ll see that we are repeating this string 100 times. We are doing it to increase the running time of our code.
Now we can work on the multiprocessing code and also compare it with some alternatives:
The first thing we need to do is to define a main. This part is needed for our multiprocessing to work. After that we have written 2 other ways to use the same function to do the job.
The first one is a classic for loop with an external output list named result. In this case we are going to loop over all the words contained in the text (using the string split functionality) and then we are going to append the result of our function to the output list.
The second one is done by using a list comprehension, but the process is the same as before. We loop over every word and pass it to our function.
Finally we have the multiprocessing one. As you can see in the code, we create a multiprocessing Pool of 4 processes (one per core of my notebook) and then we call our function by using the map method of the Pool class.
If we run this code we can see the timing of our 3 examples:
As you can see the fastest way to do this task is by using a List Comprehension and the slowest is by using our Pool. If you don’t know how multiprocessing works you are probably asking yourself: why is it slower? I’m not going into detail in this article (you can look it up on the internet if you are interested), the simple reason is that our example is too simple to benefit from multiprocessing. You’ll see improvement in your code timing when you apply multiprocessing to a CPU heavy task, otherwise the additional steps needed to create and manage the pool will only slow down your code!
Now that we know this, we can have a look at how to use Pool to manage functions with more than one parameter. The first thing we need to do is change our function:
And now we can edit the main part:
In this case we are using the starmap method that accept a list of parameters to be passed to our function. Using a list comprehension we can create our list of tuples which contains the word and the length of the words we want to mantain.
That’s all for today, as usual you’ll find the code for this example on my github! I’ve also updated my website this week, have a look if you’re interested.