Multiprocessing for Data Scientists in Python
Why pay for a powerful CPU if you can’t use all of it?
An Intel i9–9900K with 8 cores ranges from $450 to $500
That’s a lot of money to be spending on a CPU.
And if you can’t utilize it to its fullest extent, why even have it?
Multiprocessing lets us use our CPUs to their fullest extent. Instead of running programs line-by-line, we can run multiple segments of code at once, or the same segments of code multiple times in parallel. And when we do this, we can split it among multiple cores in our CPU, meaning that can compute calculations much faster.
And luckily for us, Python has a built-in multiprocessing library.
The main feature of the library is the Process
class. When we instantiate Process
, we pass it two arguments. target
, the function we want it to compute, and args
, the arguments we want to pass to that target function.
import multiprocessing
process = multiprocessing.Process(target=func, args=(x, y, z))
After we instantiate the class, we can start it with the .start()
method.
process.start()
On Unix-based operating systems, i.e., Linux, macOS, etc., when a process finishes but has not been joined, it becomes a zombie process. We can resolve this with process.join()
.