Columnar or Row based File Format: Parquet, Avro or Orc for Big Data Analytics

If you are working with big data, you might have encountered different file formats for storing and processing large-scale datasets. Some of the common file formats are CSV, JSON, Avro, ORC, and Parquet. Each of these formats has its own advantages and disadvantages, depending on the use case and the data characteristics. In this blog post, we will focus on Parquet, a columnar file format that is designed for efficient data storage and retrieval. We will also compare Parquet to other columnar formats, such as ORC and Avro, and see why Parquet is a good choice for big data analytics.

Code Breaker
7 min readOct 2, 2023

What is Parquet?

Parquet is an open source file format that is based on the concept of columnar storage. Columnar storage means that the data is organized by columns, rather than by rows, as in traditional row-based formats like CSV or JSON. This has several benefits for big data analytics, such as:

Compression: Since the data type for each column is similar, the compression of each column is straightforward and effective. Parquet supports various…

--

--