Pandas vs Dask : The Power of Parallel Computing!
Assuming you are running code on the personal laptop, for example, with 32GB of RAM, you have been handling large datasets for your machine learning projects. But, due to your PC configuration, It takes much time to process the data. So, you have decided to upgrade my pc to a better configuration. But, still, you faced the issue. After some research, I found the problem is with pandas and not with my PC. After some research, I found many solutions like PySpark, Dask, Modin, etc. so I’ve decided to benchmark the most popular ones and allow myself to know in which case use what. Here, we are gonna benchmark Pandas and Dask.
It scales NumPy, pandas, and sci-kit-learn. Just use Dask instead of those libraries, and you’re good to go. What makes it fast that it allows us to use multiple cores and disk for spilling data if it doesn’t fit in memory. The most significant advantage here I’d say that it allows more parallelism on your local machine. Faster results with less effort. The below image explains parallel computing in dask.
Let's get started with Benchmarking!
Let's write a simple function in python as well using dask framework. It's just an additional function, where I have included the sleep function to easily interpret the results. Python takes around 3s.
2. Reading a DataFrame
Reading a data frame is the most common thing while getting started with machine learning. Here, Pandas uses the traditional procedure of reading data frames, but dask uses parallel computing. Where the data frame is split into parts and then it is processed. I have used a million-row data frame here. You can see the drastic difference below!
3. Exporting to CSV
Processing our data frame to CSV is a CPU-intensive process. In the case of large datasets, but dask lags in this part. As I have mentioned earlier, dask splits the data into small chunks according to the size. Whenever you export a data frame using dask. It will be exported as 6 equally split CSVs(the number of splits depends on the size of data or upon your mention in the code). But, Pandas exports the dataframe as a single CSV. So, Dask takes more time compared to Pandas.
Merging data frames is intensive in the part of data manipulation. So, we added this process in our benchmarking process. So, we can get a clear understanding of the Pandas and Dask.
So far, We saw the Pros and Cons of both pandas and dask. So, due to parallel computing in dask a single task is done by four workers unlike the single worker in pandas. Learning dask is also not a quite tough job. It uses the Pandas API that overlays with this technology, That’s It for this article. Enjoy Learning!