Pandas, Dask or PySpark? What Should You Choose for Your Dataset?

Alina Zhang
Aug 22, 2019 · 2 min read

Do you need to handle datasets that are larger than 100GB?

Assuming you are running code on the personal laptop, for example, with 32GB of RAM, which DataFrame should you go with? Pandas, Dask or PySpark? What are their scaling limits?

The purpose of this article is to suggest a methodology that you can apply in daily work to pick the right tool for your datasets.

Image for post
Image for post
Pandas or Dask or PySpark

< 1GB

If the size of a dataset is less than 1 GB, Pandas would be the best choice with no concern about the performance.

1GB to 100 GB

If the data file is in the range of 1GB to 100 GB, there are 3 options:

> 100GB

What if the dataset is larger than 100 GB?

Pandas is out immediately due to the local memory constraints. How about Dask? It might be able to load the data into Dask DataFrame depends on the datasets. However, the code would be hanging when you call APIs.

PySpark can handle petabytes of data efficiently because of its distribution mechanism. The SQL like operations are intuitive to data scientists which can be run after creating a temporary view on top of Spark DataFrame. Spark SQL also allows users to tune the performance of workloads by either caching data in memory or configuring some experimental options.

Then, do we still need Pandas since PySpark sounds super?

The answer is “Yes, definitely!”

There are at least two advantages of Pandas that PySpark could not overcome:

In practice, I would recommend converting Spark DataFrame to a Pandas DataFrame using method toPandas() with optimization with Apache Arrow. Examples can be found at this link.

It should be done ONLY on a small subset of the data. For example, the subset of the data you would like to apply complicated methods on, or the data you would like to visualize.

In this article, we went through 3 scenarios based on the volumes of data and offered solutions for each case. The core idea is to use PySpark for the large dataset and convert the subset of data into Pandas for advanced operations.

Curious. How do you handle large datasets (>100GB) on your laptop at work?

Data Driven Investor

empowering you with data, knowledge, and expertise

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Alina Zhang

Written by

Data Scientist: Leave the world something valuable.

Data Driven Investor

empowering you with data, knowledge, and expertise

Alina Zhang

Written by

Data Scientist: Leave the world something valuable.

Data Driven Investor

empowering you with data, knowledge, and expertise

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store