Analytics Vidhya
Published in

Analytics Vidhya

What’s the buzz about Parquet File format?

Photo by Mr Cup / Fabien Barral on Unsplash

Parquet is an efficient row columnar file format which supports compression and encoding which makes it even more performant in storage and as well as during reading the data

Parquet is a widely used file format in the Hadoop eco system and its widely received by most of the data science world mainly due to the performance.

We are aware that the parquet as a row columnar file format, but it does more than that under the hood to efficiently store the data.

In this blog we will be talking in depth of the parquet file format and why is preferred in the big data eco system.

Overview:

Parquet is initiated by twitter and cloudera and inspired by Dremel https://research.google/pubs/pub36632/

Why we need to worry about file formats??

1. Easy integration with current pipeline

2. Less IO

3. Less storage

4. Network IO

5. Cost

6. Query time

7. And many more

Parquet is an open source file format available to any project in the Hadoop ecosystem.

On a high level we know that the parquet file format is:

  1. Hybrid storage

2. Supports nested columns

3. Binary format

4. Encoded

5. Compressed

6. Storage efficient

We will be going through all the above in depth.

Hybrid storage

Let’s take an example and see how the data is represented in the Parquet.

A simple example file:

Example data
Row based storage
Column based storage
Hybrid storage

Here we can see that the hybrid is a combination of row and columnar storage.

In the case of individual columns grow big and storing them in column oriented wouldn’t give any better performance.

Consider we need to read the the second column in a table where the record size is a million. Here we need to traverse a million record of 1st column and them we can reach to second column.

Plus, as we know the files are immutable, we cannot really store the append a new data in the old column chunks.

Hybrid storage will come to rescue here.

supports Nested columns

When we list any directory, which contains parquet files we see:

/my/path/to/parquet/folder/

-part-r-00000-dhb447fh5-c123s-w2232–23hghgg4534a12-snappy.parquet

-part-r-00001-dhb447fh5-c123s-w2232–23hghgg4534a12-snappy.parquet

Lets take 1 nested column file and see how will it be represented in the parquet file format

Required: Same Repetition and definition level as parent

Optional: Same repetition valuer as parent, increment definition levels

Repeated: increment both repetition and definition levels

R value = whether the column is repeated and what level nesting is repeated

D value = how far we need to dig to see if that value is null

Encoded

Using the Dremel encoding method, parquet will encode the nested columns.

Based on the data on how many times its repeated and it will check at what definition we need to traverse to find the value for the column.

There are different encoding schemes available. And for the current discussion we shall talk about only the important ones.

Plain:

In Plain encoding, the data will be stored as is one after another.

Incremental Encoding:

In the incremental encoding, the value for the column will be defined once, and for the repetitive columns it will derive it from the previously defined value.

For eg. In below we can see that the value for the column is increasing.

So we shall write the encoded data once and for the remaining columns we will derive it from previous column value.

Almost 50 reductions

Dictionary Encoding:

Almost 80 to 90 percent reduction

Note:

  1. If dictionary grows is too big, automatic fallback to plain

2. You may increase Parquet.dictionary.page.size

3. or decrease Parquet.block.size

Other encoding schemes

> Plain

> Dictionary

> Run length Encoding

> Delta Encoding

> Delta length ByteArray

> Delta String

Data Representation in the parquet file

  • Block (hdfs block): Logical split of the data in the storage layer(hdfs, adls, s3)
  • File: A physical block of the data in the storage layer(hdfs, adls, s3)
  • Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.
  • Column chunk: A chunk of the data for a particular column. These live in a particular row group and is guaranteed to be contiguous in the file.
  • Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which is interleaved in a column chunk.

Pictorial representation:

File metadata

contains schema, thrift headers, offsets, number of rows

Row group: group of rows

Multiple row groups in single file

Under row group, cut the columns into chunk

If an individual value is null: not going to store it

If a whole column is null: not going to store

Column Chunks:

Within the column chunk shared page header:

Under page header we have pages

Individual atomic values of parquet

§ Metdata

§ R value

§ D value

§ Encoded data

Footer: metadata> file, row group and column metadata

Optimisation using Parquet

Predicate Pushdown

During the select query with where clause, the data will be omitted from reading based on the query clause.

Where clause will be pushed to the file and will be compared against the encoded meta values like min, max and will be excluded if the required data is not present in the file.

parquet.filter.dictionary.enabled= true

Things to remember while using predicate pushdown:

1. Pushdown filters doesn't work with AWS s3 there is no random access. So network IO to pull the data and the applies the filter

2. Pushdown filters doesn’t work on the nested columns

3. Very high value and very low value: sort it first

4. Use the same datatype in where clause

5. Write using parquet is slower as it has to calculate create row groups, split it into chunks and write the metadata with encoded data.

You may please post your feedback in the comment.

Ajith Shetty

BigData Engineer — Love for Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Subscribe to my: Weekly Newsletter Just Enough Data

--

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Recommended from Medium

Kappa Architecture is Mainstream Replacing Lambda

QuantStack: 2021 in review

Git, the pedantic way

I Processed 558k Transactions on AWS Lambda in 5 Minutes

MMD OreSpawn 1.12.2/1.10.2 — Library for Minecraft

Hello World in a Virtual Environment using Classes, Objects, Methods and Imports

Supporting Legacy Flash Applications Through Kasm Browser Isolation

How to Go from Citizen Developer to App Maker

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ajith Shetty

Ajith Shetty

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/

More from Medium

Improve Apache Spark performance with the S3 magic committer

Apache Hudi pronounced “hoodie”

How to work with multiple languages on Databricks

Parquet Bloom Filter With Spark