Why you should use Parquet files with Pandas

Why you should use Parquet files with Pandas to make your analytics pipeline faster and efficient.

Tirthajyoti Sarkar

Published in

Productive Data Science

6 min readSep 28, 2021

Why Parquet?

Comma-separated values (CSV) is the most used widely flat-file format in data analytics. It is simple to understand and work with. CSV files perform decently in small to medium data regimes. However, as we progress towards working with larger datasets, there are some excellent reasons to move towards file formats using the columnar data storage principle. Also, the amount that we pay for the cloud-based storage of the large data files, can be optimized and reduced with these newer file formats.

One of the most popular of these emerging file types is Apache Parquet. With its column-oriented design, Parquet brings many efficient storage characteristics (e.g., blocks, row group, column chunks) into the fold. Additionally, it is built to support very efficient compression and encoding schemes for realizing space-saving data pipelines.

In a previous article, I discussed how the file reading from the disk storage into memory is faster and better optimized for Parquet than CSV using Python Pandas and PyArrow functions. You can check out that article here,

Why you should use Parquet files with Pandas

Why you should use Parquet files with Pandas to make your analytics pipeline faster and efficient.

Why Parquet?

Written by Tirthajyoti Sarkar