Insights Into Parquet Storage
Most of you folks working on Big data will have heard of parquet and how it is optimized for storage etc. Here I will try to share some more insights into parquet architecture and how/why it is optimized. Also, I will add some tips to effectively use parquet to utilize all of its features.
What is Parquet
Parquet is an open-source file format in the Hadoop ecosystem. It is a flat columnar storage format that is highly performant both in terms of storage as well as querying.
In Columnar storage, columns are stored in one or more contiguous blocks. Here are some advantages of columnar storage
- Better compression: Since we can efficiently compress similar columnar data
- Faster querying: We need not fetch the columns that are not needed to answer a query, hence achieving faster querying. If a table has 10 columns and we just need to group by one column, the rest of the columns need not be loaded
Internals of parquet
Here is how the parquet file layout looks like:
Block: It is the physical representation of data on HDFS and is the minimum size that can be read, by default
File: One of the more blocks constitutes a file. It may or may not have any data
Row Group: It is a logical partitioning of data in a parquet file and is the minimum amount of data that can be read from a parquet file. Ideally, the row group should be closer to the HDFS…