# Perfecting Parallel Python Programming

Use Dask for the Task and Bask in the Fast

In my last post I compared four different approaches to performing a task in Python, with NumPy and vectorized Python dramatically outperforming for loops and list comprehensions. But what if there were a way to go even faster, particularly on really large, complex problems?

If you go back to my blog post on recursive functions in Python, you’ll recall one of the challenges of of Python is its single-threaded nature. When you run a loop, a recursive function, or otherwise, by default it executes a single code command at a time and waits until that is completed before it executes the next one. To avoid these kinds of issues, we use packages like NumPy, which take Python commands and execute them at a lower level. This works great assuming a) NumPy supports the operation you’re trying to perform, and b) the data you’re working with are small enough to fit into system memory. But when we get to really customized operations, or really big data, things start to fall apart, which historically forced us into entirely different platforms like Hadoop/Spark for parallel operations on really large data. But today, this is no longer required in a large number of cases, due to the existence of Dask.

I’m not going to spend a lot of time explaining how Dask works. There are a great set of tutorials that walk you through set up and various component operations. Here, I’m going to assume you already have Dask installed and configured, and let’s just see if we can figure out the use cases where it beats NumPy. Will it be all of them? And if not, under what circumstances is it better (if any) and how much better will it be?

If you want to skip straight to the answers, check out the notebook here.

First, I started by testing a simple numpy.where() vs. dask.array.where() operation on as large of data sets as NumPy could handle on my machine. (Note: Dask can actually handle more data than you can fit in memory, but I was trying to compare apples to apples.)

The first several tests were not promising for the parallelized approach. The simple where() function replacing 0’s with -1’s on large arrays and matrices saw NumPy beat the pants off of Dask, usually by a factor of 2 or so.

So then I decided to up the stakes and make a more complex problem — creating a large random matrix, dividing all of its values by 2, squaring the result, and summing the squares.

This took a long time relatively speaking compared to the previous exercises, and with all of the added complexity, I figured this would be a chance for Dask to shine. I also printed the result output as a sanity test — to make sure Dask and NumPy were really performing a similarly complex operation.

So then I created a similar function using Dask and gave it a whirl.

Folks, I didn’t believe my eyes at first, so I ran this several times just to be sure (and it was at this point I added the print result statement as the sanity check…) Sure enough, the Dask approach on this more complex problem reduced the compute time from 4 minutes down to 6 seconds, a **97% performance improvement **on my laptop (a Mac with the new m1 chips, so your mileage may vary).

Again, check it out for yourself. Dask takes a little getting used to and doesn’t support everything that NumPy supports, but for large, complex problems, it has quickly earned a spot as my go-to based on these results.