Parquet, Avro or ORC?
Nov 4 · 3 min read

When you are working on a big data environment, you might wonder there are various data formats, the pros, the cons, how to use it for a specific use case and certain data pipeline. The data can be formed in a human-readable format like JSON or CSV file, but that doesn’t mean that’s the best way to actually store the data.
There are three optimized file formats for use in Hadoop clusters:
- Optimized Row Columnar (ORC)
- Avro
- Parquet
These file formats share some similarities and provide some degree of compression, but each of them is unique and brings its pros and cons.
The mutual traits :
- HDFS storage data format
- Files can be split across multiple disks
- Having a schema

Parquet

- Column-oriented (store data in columns): column-oriented data stores are optimized for read-heavy analytical workloads
- High compression rates (up to 75% with Snappy compression)
- Only required columns would be fetched/read (reducing the disk I/O)
- Can be read and write using Avro API and Avro Schema
- Support predicate pushdown (reducing disk I/O cost)
Avro

- Row-based (store data in rows): row-based databases are best for write-heavy transactional workloads
- Support serialization
- Fast binary format
- Support block compression and splittable
- Support schema evolution (the use of JSON to describe the data, while using binary format to optimize storage size)
- Stores the schema in the header of file so data is self-describing.
ORC

- Column-oriented (store data in columns): column-oriented data stores are optimized for read-heavy analytical workloads
- High compression rates (ZLIB)
- Hive type support (datetime, decimal, and the complex types like struct, list, map, and union)
- Metadata stored using Protocol Buffers, which allows addition and removal of fields
- Compatible on HiveQL
- Support serialization
Reference
- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
- http://parquet.apache.org/
- https://avro.apache.org/
- https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/
- https://www.linkedin.com/pulse/hdfc-storage-data-format-like-avro-vs-parquet-orc-jeetendra-gangele/
- https://www.nexla.com/resource/introduction-big-data-formats-understanding-avro-parquet-orc/
- https://community.cloudera.com/t5/Support-Questions/Between-Avro-Parquet-and-RC-ORC-which-is-useful-for/td-p/222271
- https://blog.clairvoyantsoft.com/big-data-file-formats-3fb659903271
