featurepreneur
Published in

featurepreneur

Pandas vs Dask : The Power of Parallel Computing!

Assuming you are running code on the personal laptop, for example, with 32GB of RAM, you have been handling large datasets for your machine learning projects. But, due to your PC configuration, It takes much time to process the data. So, you have decided to upgrade my pc to a better configuration. But, still, you faced the issue. After some research, I found the problem is with pandas and not with my PC. After some research, I found many solutions like PySpark, Dask, Modin, etc. so I’ve decided to benchmark the most popular ones and allow myself to know in which case use what. Here, we are gonna benchmark Pandas and Dask.

Dask:

It scales NumPy, pandas, and sci-kit-learn. Just use Dask instead of those libraries, and you’re good to go. What makes it fast that it allows us to use multiple cores and disk for spilling data if it doesn’t fit in memory. The most significant advantage here I’d say that it allows more parallelism on your local machine. Faster results with less effort. The below image explains parallel computing in dask.

Let's get started with Benchmarking!

1.Simple Function

Let's write a simple function in python as well using dask framework. It's just an additional function, where I have included the sleep function to easily interpret the results. Python takes around 3s.

Pandas:

Python Performance

Dask:

Dask Performance

2. Reading a DataFrame

Reading a data frame is the most common thing while getting started with machine learning. Here, Pandas uses the traditional procedure of reading data frames, but dask uses parallel computing. Where the data frame is split into parts and then it is processed. I have used a million-row data frame here. You can see the drastic difference below!

Pandas:

Dask:

3. Exporting to CSV

Processing our data frame to CSV is a CPU-intensive process. In the case of large datasets, but dask lags in this part. As I have mentioned earlier, dask splits the data into small chunks according to the size. Whenever you export a data frame using dask. It will be exported as 6 equally split CSVs(the number of splits depends on the size of data or upon your mention in the code). But, Pandas exports the dataframe as a single CSV. So, Dask takes more time compared to Pandas.

Pandas

Dask

4.Merging Dataframes

Merging data frames is intensive in the part of data manipulation. So, we added this process in our benchmarking process. So, we can get a clear understanding of the Pandas and Dask.

Pandas:

Dask:

Conclusion

So far, We saw the Pros and Cons of both pandas and dask. So, due to parallel computing in dask a single task is done by four workers unlike the single worker in pandas. Learning dask is also not a quite tough job. It uses the Pandas API that overlays with this technology, That’s It for this article. Enjoy Learning!

--

--

--

Microprediction/Analytics for Everyone! We help volunteers to do analytics/prediction on any data!

Recommended from Medium

Code Review: How I used PyTorch to predict on the Fashion MNIST dataset

Facebook Prophet can be used to predict the price of Bananas in the UK

Bringing all of the infrastructures as a code through ARM Templates

Social Network Analysis Web App

How to Create a Venn Diagram on Exploratory Desktop

What statistical significance and power really means in A/B testing.

Top Applications of Machine Learning in Retail

Machine Learning in Retail

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Eswara Prasad

Eswara Prasad

More from Medium

Adios Pandas! Process Big Data in a Flash using Terality, Dask, or PySpark

Setup file storage for Azure ML using Python

Tips and Tricks: Dask

HyperDriveStep in data pipelines