Multiprocessing Made Easy(ier) with Databricks

Published in

Analytics Vidhya

4 min readJul 28, 2020

I read somewhere that when dealing with computation problems involving billions of things, it’s a good idea to divide the work amongst multiple processes.

You can also use multiple threads but I’ll stick to multiprocessing since this more closely reflects how Databricks works.

Databricks is a multiprocessing platform. There are problems that on first blush do not appear to be appropriate for Databricks yet can be a great fit if you think about it differently.

Consider the problem of estimating pi using a Monte-Carlo simulation. You can think of the estimation method this way:

Throw a dart at a dartboard.
If the dart lands in the circle, you get 1 point.
Repeat steps 1 & 2 until your sick of it.
Add up your points, multiply by 4, and divide by the number of throws. This will give you an estimate for pi.

The more darts you throw the better the estimate.

Serial Implementation

The code below shows a serial implementation:

Here are the execution times on my workstation.

The execution times are very linear which is not surprising since the serial algorithm is O(n). When running on one core, the algorithm cries uncle at 1 billion throws.

Parallel Implementation using Multiprocessing

Python has a cool multiprocessing module that is built for divide and conquer types of problems. So how would you alter the serial code so that it can work when run using multiple processes?

Here is what I decided to do:

Divide the number of iterations by the number of processes I want to spawn. This is how many darts each process throws.
Spawn the appropriate number of processes.
Tell each process to throw the darts and keep track of its score.
When all processes have finished, add up the number of hits, multiply by 4, and divide by the total throws to get the estimate of pi.

Here are the execution times for 1 billion throws:

Multiprocessing has made the estimation for 1 billion throws more tractable. Note that the gains are sort of linear, and in my case, there is not much benefit beyond 4 processes. My machine has 4 physical cores (8 with hyperthreading) which is why 4 is the point of diminishing returns.

Here is the code:

Here are the key takeaways from the code:

The estimation algorithm has been restated in such a way that multiple processes can perform an estimation without interfering with each other. In other words, there is no shared state.
The Pool object takes care of collecting the results so that we can perform a sum in a safe manner.

Parallel Implementation Using Databricks

Multiprocessing has helped but there is a severe limitation. This code only works on one physical machine! What if we wanted to utilize the computing power of Azure? Why not consider Databricks?

With one executor, an estimation for 1 billion throws completed in 282 seconds. Note that Databricks created 4 tasks for this presumably because each executor has 4 physical cores.

With 8 executors, Databricks completes the estimation in only 33 seconds! That’s about 8 times faster! You can try this experiment yourself with varying numbers of executors and verify that you can get linear increases in speed.

Here is the Databricks notebook code:

Here are the key takeaways from the code:

Like the multiprocessing example, the estimation algorithm has been restated in such a way that multiple executors can perform an estimation without interfering with each other. In other words, there is no shared state.
Databricks takes care of sending the code to the executors as well as ensuring the code runs.
There are no new APIs to learn. The code is vanilla Databricks.
In order to fit the Databricks programming paradigm, an RDD containing the results of each iteration is constructed.

Conclusion

If you’re comfortable with Databricks consider it for CPU-bound parallel computations as well. You might be pleasantly surprised!

Thanks for reading!

Multiprocessing Made Easy(ier) with Databricks

Serial Implementation

Parallel Implementation using Multiprocessing

Parallel Implementation Using Databricks

Conclusion

Written by Farooq Mahmud