Do you need to handle datasets that are larger than 100GB?
Assuming you are running code on the personal laptop, for example, with 32GB of RAM, which DataFrame should you go with? Pandas, Dask or PySpark? What are their scaling limits?
The purpose of this article is to suggest a methodology that you can apply in daily work to pick the right tool for your datasets.
If the size of a dataset is less than 1 GB, Pandas would be the best choice with no concern about the performance.
1GB to 100 GB
If the data file is in the range of 1GB to 100 GB, there are 3 options:
- Use parameter “chunksize” to load the file into Pandas dataframe
- Import data into Dask dataframe
- Ingest data into PySpark dataframe
What if the dataset is larger than 100 GB?
Pandas is out immediately due to the local memory constraints. How about Dask? It might be able to load the data into Dask DataFrame depends on the datasets. However, the code would be hanging when you call APIs.
PySpark can handle petabytes of data efficiently because of its distribution mechanism. The SQL like operations are intuitive to data scientists which can be run after creating a temporary view on top of Spark DataFrame. Spark SQL also allows users to tune the performance of workloads by either caching data in memory or configuring some experimental options.
Then, do we still need Pandas since PySpark sounds super?
The answer is “Yes, definitely!”
There are at least two advantages of Pandas that PySpark could not overcome:
- stronger APIs
- more libraries, i.e. matplotlib for data visualization
In practice, I would recommend converting Spark DataFrame to a Pandas DataFrame using method toPandas() with optimization with Apache Arrow. Examples can be found at this link.
It should be done ONLY on a small subset of the data. For example, the subset of the data you would like to apply complicated methods on, or the data you would like to visualize.
In this article, we went through 3 scenarios based on the volumes of data and offered solutions for each case. The core idea is to use PySpark for the large dataset and convert the subset of data into Pandas for advanced operations.
Curious. How do you handle large datasets (>100GB) on your laptop at work?