Insights Into Parquet Storage

Cinto
The Startup
Published in
6 min readJul 30, 2020

--

Most of you folks working on Big data will have heard of parquet and how it is optimized for storage etc. Here I will try to share some more insights into parquet architecture and how/why it is optimized. Also, I will add some tips to effectively use parquet to utilize all of its features.

What is Parquet

Parquet is an open-source file format in the Hadoop ecosystem. It is a flat columnar storage format that is highly performant both in terms of storage as well as querying.

In Columnar storage, columns are stored in one or more contiguous blocks. Here are some advantages of columnar storage

  1. Better compression: Since we can efficiently compress similar columnar data
  2. Faster querying: We need not fetch the columns that are not needed to answer a query, hence achieving faster querying. If a table has 10 columns and we just need to group by one column, the rest of the columns need not be loaded

Internals of parquet

Here is how the parquet file layout looks like:

Block: It is the physical representation of data on HDFS and is the minimum size that can be read, by default

File: One of the more blocks constitutes a file. It may or may not have any data

Row Group: It is a logical partitioning of data in a parquet file and is the minimum amount of data that can be read from a parquet file. Ideally, the row group should be closer to the HDFS…

--

--

Cinto
The Startup

An engineer, a keen observer, writer about tech, life improvement, motivation, humor, and more. Hit the follow button if you want a weekly dose of awesomeness.