Why you should use Parquet files with Pandas

Why you should use Parquet files with Pandas to make your analytics pipeline faster and efficient.

Tirthajyoti Sarkar
Productive Data Science

--

Image source: Pixabay (Free to use)

Why Parquet?

Comma-separated values (CSV) is the most used widely flat-file format in data analytics. It is simple to understand and work with. CSV files perform decently in small to medium data regimes. However, as we progress towards working with larger datasets, there are some excellent reasons to move towards file formats using the columnar data storage principle. Also, the amount that we pay for the cloud-based storage of the large data files, can be optimized and reduced with these newer file formats.

One of the most popular of these emerging file types is Apache Parquet. With its column-oriented design, Parquet brings many efficient storage characteristics (e.g., blocks, row group, column chunks) into the fold. Additionally, it is built to support very efficient compression and encoding schemes for realizing space-saving data pipelines.

In a previous article, I discussed how the file reading from the disk storage into memory is faster and better optimized for Parquet than CSV using Python Pandas and PyArrow functions. You can check out that article here,

--

--

Tirthajyoti Sarkar
Productive Data Science

Sr. Director of AI/ML platform | Stories on Artificial Intelligence, Data Science, and ML | Speaker, Open-source contributor, Author of multiple DS books