Dask: Pandas for faster processing.

jayasurya karthikeyan
featurepreneur
Published in
4 min readMar 18, 2021
Dask

When data doesn’t fit in memory, we can use chunking: loading and then processing it in chunks, so that only a subset of the data needs to be in memory at any given time. But while chunking saves memory, it doesn’t address the other problem with large amounts of data: computation can also become a bottleneck.

How can we speed processing up?

One approach is to utilize multiple CPUs: pretty much every computer these days has more than one CPU. If you have two CPUs, you can often run your code (almost) twice as fast; four CPUs and you might approach a 4× speedup, and so on.

Usage of multiple CPUs enables us to get through the data in a very small amount of time. As more and more computers now support multi-threading, the CPU can manage a very large amount of data even having very few core-counts.

Even better, the chunking technique that helps reduce memory can also enable parallelism. Let’s see why, and then learn how the Dask library can easily enable parallelism of your Pandas processing code.

Dask: a parallel processing library

One of the easiest ways to do this in a scalable way is with Dask, a flexible parallel computing library for Python. Dask provides an API that emulates Pandas while implementing chunking and parallelization transparently.

Because Dask is doing all the hard work for you, a good starting point is actually a more naive version of our task. Specifically, instead of implementing chunking, we’ll switch to Pandas code that loads all the data into memory at once:

The advantage of using Dask is that the implementation and the coding part are similar to the normal pandas library. All the changes occur in the backend of the system.

import pandas

df = pandas.read_csv(
"voters.csv", usecols=[
"Residential Address Street Name ",
"Party Affiliation "
]
)

def get_counts(df):
by_party = df.groupby("Party Affiliation ")
street = by_party["Residential Address Street Name "]
return street.value_counts()

result = get_counts(df)
result.sort_values(ascending=False, inplace=True)

print(result)
# Python code for Pandas framework

Using Dask to emulate Pandas

The way Dask works involves two steps:

  1. First, you set up a computation, internally represented as a graph of operations.
  2. Then, you actually run the computation on that graph.

When Dask emulates the Pandas API, it doesn’t actually calculate anything; instead, it’s remembering what operations you want to do as part of the first step above. Only once you run compute() does the actual work happen.

The result is code that looks quite similar, but behind the scenes is able to chunk and parallelize the implementation.

import dask.dataframe as dd

# Load the data with Dask instead of Pandas.
df = dd.read_csv(
"voters.csv",
blocksize=16 * 1024 * 1024, # 16MB chunks
usecols=["Residential Address Street Name ",
"Party Affiliation "],
)

# Setup the calculation graph; unlike Pandas code,
# no work is done at this point:
def get_counts(df):
by_party = df.groupby("Party Affiliation ")
street = by_party["Residential Address Street Name "]
return street.value_counts()
result = get_counts(df)

# Actually run the computation, using 2 threads:
result = result.compute(num_workers=2)

# Sort using normal Pandas DataFrame, since Dask's
# Pandas emulation doesn't implement this method:
result.sort_values(ascending=False, inplace=True)

print(result)
# python code for DASK

Dask (usually) makes things better

The naive read-all-the-data Pandas code and the Dask code are quite similar.

  • In general, the naive version does the worst, both on memory usage and CPU time. CPU time may be slightly worse because the algorithms it uses don’t work well on really large chunks of data: both the chunked and Dask version are operating on smaller chunks of data at a time.
  • The chunked version uses the least memory, but wallclock time isn’t much better.
  • The Dask version uses far less memory than the naive version and finishes fastest (assuming you have CPUs to spare).
Computation time for dask vs pandas

Dask isn’t a panacea, of course:

  • Parallelism has overhead, it won’t always make things finish faster. And it doesn’t reduce the CPU time, so if you’re already saturating your CPUs it won’t speed things up on wallclock time either.
  • Some tuning is needed. Larger block sizes increase memory use, but up to a point also allows faster processing.

If your task is simple or fast enough, single-threaded normal Pandas may well be faster.

For slow tasks operating on large amounts of data, you should definitely try Dask out. As you can see, it may only require very minimal changes to your existing Pandas code to get faster code with lower memory use.

--

--

jayasurya karthikeyan
featurepreneur

Intern at Tactii and Tactlabs. Aviation geek, Computer Science enthusiast