What’s the buzz about Parquet File format?
Parquet is an efficient row columnar file format which supports compression and encoding which makes it even more performant in storage and as well as during reading the data
Parquet is a widely used file format in the Hadoop eco system and its widely received by most of the data science world mainly due to the performance.
We are aware that the parquet as a row columnar file format, but it does more than that under the hood to efficiently store the data.
In this blog we will be talking in depth of the parquet file format and why is preferred in the big data eco system.
Parquet is initiated by twitter and cloudera and inspired by Dremel https://research.google/pubs/pub36632/
Why we need to worry about file formats??
1. Easy integration with current pipeline
2. Less IO
3. Less storage
4. Network IO
6. Query time
7. And many more
Parquet is an open source file format available to any project in the Hadoop ecosystem.
On a high level we know that the parquet file format is:
- Hybrid storage
2. Supports nested columns
3. Binary format
6. Storage efficient
We will be going through all the above in depth.
Let’s take an example and see how the data is represented in the Parquet.
A simple example file:
Here we can see that the hybrid is a combination of row and columnar storage.
In the case of individual columns grow big and storing them in column oriented wouldn’t give any better performance.
Consider we need to read the the second column in a table where the record size is a million. Here we need to traverse a million record of 1st column and them we can reach to second column.
Plus, as we know the files are immutable, we cannot really store the append a new data in the old column chunks.
Hybrid storage will come to rescue here.
supports Nested columns
When we list any directory, which contains parquet files we see:
Lets take 1 nested column file and see how will it be represented in the parquet file format
Required: Same Repetition and definition level as parent
Optional: Same repetition valuer as parent, increment definition levels
Repeated: increment both repetition and definition levels
R value = whether the column is repeated and what level nesting is repeated
D value = how far we need to dig to see if that value is null
Using the Dremel encoding method, parquet will encode the nested columns.
Based on the data on how many times its repeated and it will check at what definition we need to traverse to find the value for the column.
There are different encoding schemes available. And for the current discussion we shall talk about only the important ones.
In Plain encoding, the data will be stored as is one after another.
In the incremental encoding, the value for the column will be defined once, and for the repetitive columns it will derive it from the previously defined value.
For eg. In below we can see that the value for the column is increasing.
So we shall write the encoded data once and for the remaining columns we will derive it from previous column value.
Almost 50 reductions
Almost 80 to 90 percent reduction
- If dictionary grows is too big, automatic fallback to plain
2. You may increase Parquet.dictionary.page.size
3. or decrease Parquet.block.size
Other encoding schemes
> Run length Encoding
> Delta Encoding
> Delta length ByteArray
> Delta String
Data Representation in the parquet file
- Block (hdfs block): Logical split of the data in the storage layer(hdfs, adls, s3)
- File: A physical block of the data in the storage layer(hdfs, adls, s3)
- Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.
- Column chunk: A chunk of the data for a particular column. These live in a particular row group and is guaranteed to be contiguous in the file.
- Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which is interleaved in a column chunk.
contains schema, thrift headers, offsets, number of rows
Row group: group of rows
Multiple row groups in single file
Under row group, cut the columns into chunk
If an individual value is null: not going to store it
If a whole column is null: not going to store
Within the column chunk shared page header:
Under page header we have pages
Individual atomic values of parquet
§ R value
§ D value
§ Encoded data
Footer: metadata> file, row group and column metadata
Optimisation using Parquet
During the select query with where clause, the data will be omitted from reading based on the query clause.
Where clause will be pushed to the file and will be compared against the encoded meta values like min, max and will be excluded if the required data is not present in the file.
Things to remember while using predicate pushdown:
1. Pushdown filters doesn't work with AWS s3 there is no random access. So network IO to pull the data and the applies the filter
2. Pushdown filters doesn’t work on the nested columns
3. Very high value and very low value: sort it first
4. Use the same datatype in where clause
5. Write using parquet is slower as it has to calculate create row groups, split it into chunks and write the metadata with encoded data.
You may please post your feedback in the comment.
BigData Engineer — Love for Bigdata, Analytics, Cloud and Infrastructure.
Subscribe to my: Weekly Newsletter Just Enough Data