Dask — To handle large data frames using parallel computing

Prakash R
featurepreneur
Published in
2 min readAug 19, 2021

Why do you need Dask?

Python packages like NumPy, pandas, sklearn, seaborn etc. make the data manipulation and ML tasks very convenient. For most data analysis tasks, the python pandas package is good enough. You can do all sorts of data manipulation and is compatible with building ML models.

But, as your data gets bigger, bigger than what you can fit in the RAM, pandas won’t be sufficient. This is a very common problem.

You may use Spark or Hadoop to solve this. But, these are not python environments. This stops you from using NumPy, sklearn, pandas, TensorFlow, and all the commonly used Python libraries for ML.

Is there a solution for this?

Yes! This is where Dask comes in.

What is Dask?

Dask is a open-source library that provides advanced parallelization for analytics, especially when you are working with large data.

It is built to help you improve code performance and scale-up without having to re-write your entire code. The good thing is, you can use all your favorite python libraries as Dask is built in coordination with numpy, scikit-learn, scikit-image, pandas, xgboost, RAPIDS and others.

That means you can now use Dask to not only speed up computations on datasets using parallel processing, but also build ML models using scikit-learn, XGBoost on much larger datasets.

What is Parallel Processing?

Parallel processing refers to executing multiple tasks at the same time, using multiple processors in the same machine.

Generally, the code is executed in sequence, one task at a time. But, let’s suppose, you have a complex code that takes a long time to run, but mostly the code logics are independent, that is, no data or logic dependency on each other. This is the case for most matrix operations.

So, instead of waiting for the previous task to complete, we compute multiple steps simultaneously at the same time. This lets you take advantage of the available processing power, which is the case in most modern computers, thereby reducing the total time taken.

--

--