Multiprocessing for Data Scientists in Python
Why pay for a powerful CPU if you can’t use all of it?
That’s a lot of money to be spending on a CPU.
And if you can’t utilize it to its fullest extent, why even have it?
Multiprocessing lets us use our CPUs to their fullest extent. Instead of running programs line-by-line, we can run multiple segments of code at once, or the same segments of code multiple times in parallel. And when we do this, we can split it among multiple cores in our CPU, meaning that can compute calculations much faster.
And luckily for us, Python has a built-in multiprocessing library.
The main feature of the library is the
Process class. When we instantiate
Process , we pass it two arguments.
target, the function we want it to compute, and
args, the arguments we want to pass to that target function.
process = multiprocessing.Process(target=func, args=(x, y, z))
After we instantiate the class, we can start it with the
On Unix-based operating systems, i.e., Linux, macOS, etc., when a process finishes but has not been joined, it becomes a zombie process. We can resolve this with
In this article, we will cover how to use the multiprocessing library in Python to load high-resolution images into numpy arrays much faster, and over a long enough period, save hours of computation.
But before we go on to implementing multiprocessing in a real-world example, let’s make a little toy-script to demonstrate how it works.
We can generate some random data to process:
import numpy as np
fake_data = np.random.random((100, 1000000))
Then, we can write some function that performs random calculations on the data:
if statement checking to see if
None is technically pointless since
np.random.random will never return
None, but regardless, it represents an extra computation for us to speed up with multiprocessing.
We can evaluate how much time this function takes to compute with:
On my Intel i7–8700K 3.70GHz CPU, running this takes approximately 30 seconds.
That’s not very fast. And one can only imagine what it would look like if the data were bigger, or the computations required more expensive.
So the question arises, why does this take so long to compute?
Let’s have a look at the CPU usage log while we are running this:
No wonder the function is taking so long, we only have a fraction of our computing power dedicated to running it.
Let’s fix that.
Using a regular
ndarray will not work, as each process has separate memory, and will be unable to change the global array.
We can install SharedArray with:
pip install SharedArray
and then import it with:
SharedArray has a few key functions:
SharedArray.create(name, shape, dtype=float)creates a shared memory array
SharedArray.attach(name)attaches a previously created shared memory array to a variable
SharedArray.delete(name)deletes a shared memory array, however, existing attachments remain valid
There are plenty of other useful features SharedArray offers, and I would recommend reading the PyPI page for documentation.
Implementing the Multiprocessing Function
Inside of the multiprocessing function, we can create a shared memory array:
Now we have to define a child-function inside of
multiprocess_data() that will calculate an individual row of the data.
This is so that we can pass
target function when we create our processes later.
Now, for each row in the fake data, we can create a new
Process and start it:
And finally, after all the processes have been started, we can
.join() them, and return the data:
As a recap, here is the full function with multiprocessing implemented:
Now it’s time to evaluate our improvement.
…prints approximately 6 seconds.
That’s quite an improvement!
And if we have a look at the CPU usage log while running the function…
That looks a lot better!
One may ask if my CPU says it has 12 cores, why does the process only speed up 5–6 times?
Many modern CPU architectures use hyper-threading, which means that although to my OS it appears that I have 12 cores; in reality, I only have half of that, while the other half is simulated.
You may also want to delete the shared memory file after the computations are finished so that no errors will be raised if you’re going to re-run your program.
You can do that with:
Multiprocessing in the Real World
Let’s say we’re in a Kaggle competition with a ton of images we need to load. Understanding Clouds from Satellite Images, let’s say. Maybe we want to make a data-generator function that spits out a batch of 300 images, in
Assuming we had all of the image data in a folder named “train_images,” we would write a function something along the lines of…
…to load the data.
os.listdirreturns a list of everything in a directory, which in our case is the file names of all the images
cv2.imreadreads an image and automatically turns it into a Numpy array. You can read more about installing and using OpenCV on Wheels (cv2) here.
Of course, this function doesn’t return the labels for the images, and it gives you the same 300 images every time, so it’s not entirely practical, but that functionality is simple to add.
Then we can measure the time this function takes to compute using the same methods we had earlier, which for me is approximately 12–13 seconds.
That’s not good. If we are going to be doing this inside of a generator to pass to a prediction model, a lot of our time is going to be spent loading the images. We can’t keep all of the images loaded at once, as loading only ~300 1400x2100 images into Numpy arrays takes 20–25 GiB of RAM.
Using the skills we learned when multiprocessing our toy example, we can speed this function up a lot.
First, we can create a
SharedArray for our data:
Next, we do something a little different. Since we don’t have infinite CPU cores, creating more processes than we have cores will ultimately slow down the function. To resolve this, we can create
workers, which will each have a set amount of images to load,
You can change the number of workers to fit the specifications of your CPU.
After that, we can create a target function that each worker will compute; the function will take a starting index,
i, and the number of images to load,
Then, we can start a new process for every worker, and assign it
worker_amount images to load.
worker_amount*worker_num gives us the index at which to start loading the next set of images.
And finally, we can
.join() each process, and return the data we generated.
As a quick recap, here is the full function we just wrote:
Now using the same method we’ve used before, we can time our function, which, for me, takes a bit less than 2 seconds.
If we would be doing this hundreds, or even thousands of times, saving 11 seconds per loading session equates to anywhere between, 18 minutes and 3 hours of saved time for 100 times and 1000 times respectively.
The example above is reflective of not only how beneficial multiprocessing can be, but of how important it is for us to optimize the calculations we do most frequently.
for loop used inefficiently, will overtime, on a large enough scale, cost a company hundreds of hours, and thousands of dollars.
Even with all the benefits of multiprocessing, we can still go faster. This brilliant article, by George Seif, explains how to accelerate data-science using your GPU. Your GPU is specifically designed for parallel computation, and when given a large enough dataset, is exponentially faster than your CPU.
GPUs however, are generally harder to work with and are beyond the scope of this article.
And as always, the code for this article is available here, on my GitHub.
Documentation and Sites:
- Multiprocessing docs: https://docs.python.org/3.7/library/multiprocessing.html
- SharedArray PyPi: https://pypi.org/project/SharedArray/
- Open CV on Wheels (cv2) GitHub: https://github.com/cancan101/opencv-python
- Pandarallel GitHub: https://github.com/nalepae/pandarallel
More GPU Accelerated Python Articles: