Introduction to Parallel Computing in Python using Dask

Hemanthhari2000

Published in

featurepreneur

6 min readMar 30, 2021

Understand the basics of parallel computing and Dask

Introduction

Python is one of the fastest-growing programming languages in the world right now. It is widely used in the field of data science and research. The reason python being so famous is because of its nature and its libraries like NumPy, Pandas, scikit-learn and so. These libraries are powerful to build models, train and also visualize them. But, this comes at a price. Though they are powerful libraries, they run on single-core, which slows down the process drastically.

This is the reason why Python is slow while doing machine learning computations. So, how can we speed up this process? Can it be done? Is it possible to make python code faster? Of Course python computation can be boosted using Dask.

Overview

Let’s take a look at the contents that we will cover in this article.

What is Parallel Computing?
What is Dask?
Prerequisites
Installing Dask
Implementation of Dask
Conclusion

What is Parallel Computing?

Parallel computing is a type of computing architecture in which several processors simultaneously execute multiple, smaller calculations broken down from an overall larger, complex problem.

As soon as the program starts executing, it starts to divide the problem into several smaller problems or into several subtasks. Each subtask is executed independently without any interference of an external force and the result is finally added together to get the resultant output.

What is Dask?

Dask is an open-source and flexible library for parallel computing written in Python. It is a platform to build distributed applications. It does not load the data immediately but, it only points to the data and only the necessary data is used or showed to the user. Dask has the functionality to use more than a single-core processor and it uses parallel computing which makes it very fast and efficient with huge datasets. It avoids errors related to memory overflow.

Dask has two families of task schedulers:

Single machine scheduler: This scheduler provides basic features on a local process or thread pool. This scheduler was made first and is the default. It is simple and cheap to use. It can only be used on a single machine and does not scale.
Distributed scheduler: This scheduler is more sophisticated. It offers more features, but also requires a bit more effort to set up. It can run locally or distribute across a cluster.

Prerequisites

All you need prior to setup dask are

Python (latest version recommended)
Anaconda with Jupyter Notebook (recommended)

Installing Dask

If you have Anaconda then, it comes preinstalled in your environment. But, if for some reason you don't have Dask preinstalled then go ahead with the following.

Create a new Conda Environment

conda create -y -n dask_env

Activate your Conda Environment

conda activate dask_env

install dask

conda install dask

(or)

pip install dask

Note: This will only install a few core packages related to Dask. If you want everything then go ahead and paste this command.

Install all Dask packages

python -m pip install "dask[complete]"

That’s it, you have just installed all the required Dask packages. Now you can go ahead and start messing with Dask.

Implementation of Dask

Let’s get started with a simple implementation of Dask. But before that, we need to understand why we are using Dask right? so, for that let us get back to our command prompt and type the following.

import numpy as np

dataset = np.ones((100000, 1000, 1000))

As you can see that we used the NumPy library and we generated some data using the np.ones() function. Here, it is clearly visible that the dimensions are quite big. Now let us execute the following and see the result.

Oh, snap!!! see, we got an error saying MemoryError which means that NumPy was not able to store such a huge amount of data. As a Data Scientist, it is our job to handle large datasets efficiently. So, let’s use Dask and see what happens.

import dask.array as da

dataset = da.ones((100000, 1000, 1000))

Just by replacing numpy with dask.array, we are getting this as the output.

Wow!!! no errors. That's the power of dask.

By integrating with existing libraries dask enables developers and data scientists to transition from traditional single machine workflows to parallel and distributed computing without learning new frameworks or writing long and complex codes.

Let us try out one more example to understand Dask and parallel computation better.

let us run the following code.

from time import sleep

def inc(x):
    sleep(1)
    return x+1

def add(x, y):
    sleep(1)
    return x+y

just a normal python function with a delay of 1 second. Now, run the following to execute the functions.

%%time #computes time taken to execute the code.

x = inc(1)
y = inc(2)
z = add(x, y)

Well the output after execution of the above code is as follows

CPU times: user 694 µs, sys: 47 µs, total: 741 µs
Wall time: 2 s

In Jupyter Notebook it took around 2 seconds to execute the above code. That is to increment 2 given numbers and then to add the 2 numbers.

now let us use the Dask Delayed decorator to check the parallel computation in python by wrapping the existing functions as shown:

from time import sleep
from dask import delayed

@delayed
def inc(x):
    sleep(1)
    return x+1

@delayed
def add(x, y):
    sleep(1)
    return x+y

Now we need to execute the below code to see the actual parallel computing in action.

%%time

# we wrap the delayed decorator and then pass in the inputs.
x = delayed(inc)(1)  
y = delayed(inc)(2)
z = delayed(add)(x, y)

By executing the following code the output is as shown below:

CPU times: user 848 µs, sys: 59 µs, total: 907 µs
Wall time: 3.97 ms

hence with this, it is proved that dask is way faster in parallel computing.

Note: That the above code was executed in 3.97 milliseconds (ms) not in seconds.

Conclusion

In this article, we have seen about Dask and its parallel computing capability. We went through the concept of parallel computing and the implementation part where we implemented a simple python code to make us understand the importance of parallel computing and how easy Dask makes it for us to do it. I hope this article was useful to you all. Will see you in my next article until then as always code learn repeat …….

Follow for more…

Hemanthhari2000 - Overview

Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Mon-Wed Fri Hemanthhari2000 has no activity yet for this period…

github.com