Distributed computing for Advanced Analytics

Parallel Processing in Python Using DASK

Scale-up Pandas, Numpy and Sci-kit learn natively

Shashvat G
The Startup

--

Image Source: Dask

“Data is the new science. Big Data holds the answers.” — By Pat Gelsinger

Big Data does hold the answer. The more data we have, the more possibilities of gaining business value out of it. However, gathering data is not the only challenge, we also need to consider data storage and processing. As a Data Scientist, we often use tools like Pandas and Numpy to analyze data since these are widely trusted and efficient. However, as the size of the dataset increases, we start facing the actual limitations of these tools. What do we do next? We switch to a more scalable solution such as Spark and at times, this rework is time-consuming. Wouldn’t it be wonderful if you could do it in your system locally or scale-up to a cluster on a need basis? Dask can help you achieve that.

This post assumes that you already have some knowledge of Python Pandas and NumPy. To give you an overview, Pandas is super useful for data cleaning, analysis for smaller data, and Numpy is mainly used for working with a high-performance multidimensional array. That being said, let’s dive right into Dask.

So, what is Dask?

Dask is a parallel computation framework that has seamless integration with your Jupyter notebook. Originally, it was built to overcome the storage limitations of a single machine and extend the computation capability of Pandas, Numpy, and Scit-kit Learn with DASK equivalents, but soon it found its use as a generic distributed system.

If you are familiar with Pandas and NumPy, you are good to go with Dask Dataframe and arrays.

Dask has two main virtues:

  1. Scalability

Dask scales up Pandas, Scikit-Learn, and Numpy natively with python and Runs resiliently on clusters with multiple cores or can also be scaled down to a single machine.

2. Scheduling

Dask Task Schedulers are optimized for computation much like Airflow, Luigi. It provides rapid feedback, tracks tasks using Task graphs, and aids in diagnostics both in local and distributed mode aking it interactive and responsive.

Image Source: Dask

Dask also provides a real-time and responsive dashboard that shows several metrics like progress, memory use, etc. which is updated every 100ms.

Installation

Dask can be installed with Conda/pip or cloned from the git repo, depending on your preference.

conda install daskconda install dask-core (Only installs core)

Dask-core is a mini dask installation that will only install core packages. The same applies to pip. You can also install only the dask data frame, or Dask array if you are only concerned with scaling up your pandas, Numpy with Dask Dataframe, Dask arrays respectively.

python -m pip install dask
python -m pip install "dask[dataframe]" # Install requirements for dask dataframe
python -m pip install "dask[array]" # Install requirements for dask array

Dask DataFrames Vs Pandas DataFrame

You should refrain from using Dask if Pandas is working just fine for you. Keeping that in mind, here is a demonstration for importing an example CSV file in pandas and Dask using Jupyter notebook :

Pandas

Dask does Lazy computing much like Spark, which essentially means it does not perform the job until action is requested, Dask stays ideal in terms of computation until the compute method(in this case) is invoked.

Dask Dataframe

Dask ML

The Dask equivalent of Scikit-Learn is Dask-ML. Here is how we can use XGBoost Regressor(a popular Gradient Boosting algorithm for regression) in Dask:

from dask_ml.xgboost import XGBRegressormodel = XGBRegressor(...)
model.fit(train, train_labels)

Dask Vs Spark

Spark is a resilient cluster computing framework tool that splits data and processing into small chunks, distributes it across a cluster of any size, and runs them in a parallel manner.

Although Spark is a universal go-to tool for Big data analysis, yet Dask seems quite promising. Dask is written as a python component and is lightweight but on the other hand, Spark offers more functionality, is written in Scala, and also offers support for Python/R. Spark should be your first choice if you are looking for an all-inclusive solution or have JVM infrastructure, but if you want to experience parallel computation swiftly which is lightweight, Dask is worth giving a try. With a simple pip install, it will be at your disposal.

Conclusion

Dask is a fault-tolerant, elastic framework for parallel computation in python that can be deployed locally, on the cloud, or high-performance computers. Not only it scales out capabilities of Pandas and NumPy, but also it can be used as Task schedulers. It is worth mentioning that Dask is not meant as a replacement for Pandas and NumPy since it does not provide the complete functionality as prior. You can read more on Dask here.

I’d love to hear your thoughts about Dask, Machine learning, and Distributed Computing in the comments below.

If you found this useful and know anyone you think would benefit from this, please feel free to send it their way.

--

--

Shashvat G
The Startup

Data Scientist | Analyst who aspires to continuously learn and grow in Data Science Space. Find him on LinkedIn https://www.linkedin.com/in/shashvat-gupta/