Multiprocessing for Data Scientists in Python

Why pay for a powerful CPU if you can’t use all of it?

Sebastian Theiler
Oct 2 · 6 min read

An Intel i9–9900K with 8 cores ranges from $450 to $500

That’s a lot of money to be spending on a CPU.
And if you can’t utilize it to its fullest extent, why even have it?

Multiprocessing lets us use our CPUs to their fullest extent. Instead of running programs line-by-line, we can run multiple segments of code at once, or the same segments of code multiple times in parallel. And when we do this, we can split it among multiple cores in our CPU, meaning that can compute calculations much faster.

And luckily for us, Python has a built-in multiprocessing library.

The main feature of the library is the Process class. When we instantiate Process , we pass it two arguments. target, the function we want it to compute, and args, the arguments we want to pass to that target function.

import multiprocessing
process = multiprocessing.Process(target=func, args=(x, y, z))

After we instantiate the class, we can start it with the .start() method.

process.start()

On Unix-based operating systems, i.e., Linux, macOS, etc., when a process finishes but has not been joined, it becomes a zombie process. We can resolve this with process.join().

In this article, we will cover how to use the multiprocessing library in Python to load high-resolution images into numpy arrays much faster, and over a long enough period, save hours of computation.


But before we go on to implementing multiprocessing in a real-world example, let’s make a little toy-script to demonstrate how it works.

We can generate some random data to process:

import numpy as np
fake_data = np.random.random((100, 1000000))

Then, we can write some function that performs random calculations on the data:

The if statement checking to see if data_point is None is technically pointless since np.random.random will never return None, but regardless, it represents an extra computation for us to speed up with multiprocessing.

We can evaluate how much time this function takes to compute with:

On my Intel i7–8700K 3.70GHz CPU, running this takes approximately 30 seconds.

That’s not very fast. And one can only imagine what it would look like if the data were bigger, or the computations required more expensive.

So the question arises, why does this take so long to compute?

Let’s have a look at the CPU usage log while we are running this:

A graph of CPU usage. One CPU core is at 100%, while the others remain low
A graph of CPU usage. One CPU core is at 100%, while the others remain low
One CPU core is at 100% usage, while the others sit around at less than 20% usage.

No wonder the function is taking so long, we only have a fraction of our computing power dedicated to running it.

Let’s fix that.

Before we start, we need to import one more module. SharedArray by tenzing is a module for creating Numpy arrays that can be accessed by different processes on a computer.

Using a regular ndarray will not work, as each process has separate memory, and will be unable to change the global array.

We can install SharedArray with:

pip install SharedArray

and then import it with:

import SharedArray

SharedArray has a few key functions:

  • SharedArray.create(name, shape, dtype=float) creates a shared memory array
  • SharedArray.attach(name) attaches a previously created shared memory array to a variable
  • SharedArray.delete(name) deletes a shared memory array, however, existing attachments remain valid

There are plenty of other useful features SharedArray offers, and I would recommend reading the PyPI page for documentation.

Inside of the multiprocessing function, we can create a shared memory array:

Now we have to define a child-function inside of multiprocess_data() that will calculate an individual row of the data.
This is so that we can pass multiprocessing.Process a target function when we create our processes later.

Again, these are the same useless calculations performed as a computing test.

Now, for each row in the fake data, we can create a new Process and start it:

And finally, after all the processes have been started, we can .join() them, and return the data:

As a recap, here is the full function with multiprocessing implemented:

Now it’s time to evaluate our improvement.

Running…

Multiprocessing programming guidelines for why you need: if __name__ == “__main__”

…prints approximately 6 seconds.

That’s quite an improvement!

And if we have a look at the CPU usage log while running the function…

A graph of CPU usage. All CPU cores go to 100% in a plateaued arc
A graph of CPU usage. All CPU cores go to 100% in a plateaued arc
All cores are being used to their maximum extent

That looks a lot better!

One may ask if my CPU says it has 12 cores, why does the process only speed up 5–6 times?
Many modern CPU architectures use hyper-threading, which means that although to my OS it appears that I have 12 cores; in reality, I only have half of that, while the other half is simulated.

You may also want to delete the shared memory file after the computations are finished so that no errors will be raised if you’re going to re-run your program.
You can do that with:

SharedArray.delete('data')

Multiprocessing in the Real World

Let’s say we’re in a Kaggle competition with a ton of images we need to load. Understanding Clouds from Satellite Images, let’s say. Maybe we want to make a data-generator function that spits out a batch of 300 images, in ndarrray form.

Assuming we had all of the image data in a folder named “train_images,” we would write a function something along the lines of…

The images in the dataset are 1400x2100 and in RGB

…to load the data.

  • os.listdir returns a list of everything in a directory, which in our case is the file names of all the images
  • cv2.imread reads an image and automatically turns it into a Numpy array. You can read more about installing and using OpenCV on Wheels (cv2) here.

Of course, this function doesn’t return the labels for the images, and it gives you the same 300 images every time, so it’s not entirely practical, but that functionality is simple to add.

Then we can measure the time this function takes to compute using the same methods we had earlier, which for me is approximately 12–13 seconds.

That’s not good. If we are going to be doing this inside of a generator to pass to a prediction model, a lot of our time is going to be spent loading the images. We can’t keep all of the images loaded at once, as loading only ~300 1400x2100 images into Numpy arrays takes 20–25 GiB of RAM.

Using the skills we learned when multiprocessing our toy example, we can speed this function up a lot.

First, we can create a SharedArray for our data:

Next, we do something a little different. Since we don’t have infinite CPU cores, creating more processes than we have cores will ultimately slow down the function. To resolve this, we can create workers, which will each have a set amount of images to load, worker_amount.

You can change the number of workers to fit the specifications of your CPU.

After that, we can create a target function that each worker will compute; the function will take a starting index, i, and the number of images to load, n.

Then, we can start a new process for every worker, and assign it worker_amount images to load.

worker_amount*worker_num gives us the index at which to start loading the next set of images.

And finally, we can .join() each process, and return the data we generated.

As a quick recap, here is the full function we just wrote:

Now using the same method we’ve used before, we can time our function, which, for me, takes a bit less than 2 seconds.

If we would be doing this hundreds, or even thousands of times, saving 11 seconds per loading session equates to anywhere between, 18 minutes and 3 hours of saved time for 100 times and 1000 times respectively.

Conclusion

The example above is reflective of not only how beneficial multiprocessing can be, but of how important it is for us to optimize the calculations we do most frequently.

A single for loop used inefficiently, will overtime, on a large enough scale, cost a company hundreds of hours, and thousands of dollars.


Even with all the benefits of multiprocessing, we can still go faster. This brilliant article, by George Seif, explains how to accelerate data-science using your GPU. Your GPU is specifically designed for parallel computation, and when given a large enough dataset, is exponentially faster than your CPU.

GPUs however, are generally harder to work with and are beyond the scope of this article.


I would also recommend checking out the library Pandarallel by Manu NALEPA, which allows multiprocessing on Pandas dataframes.
Sadly, this library is only available on Unix-based operating systems, due to its backend.


And as always, the code for this article is available here, on my GitHub.

Documentation and Sites:

More GPU Accelerated Python Articles:

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Sebastian Theiler

Written by

Human

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade