Cool Sh*t I Just Learned: Parquet’s Predicate Pushdown

Does it really exist? A pursuit of finding Parquet’s Predicate Pushdown empirical evidence 🕵🏻‍♂️

Ence Prastama
4 min readMay 31, 2020

I have worked with the duet of Presto and Parquet for quite some time now. As you may have known, by using columnar storage, there is this feature called Predicate Pushdown that can make your query execution time blazing fast ⚡️⚡️⚡️

What is this Predicate Pushdown?
TL;DR with Predicate Pushdown, you will only pull data that satisfy your predicate!

less data → less I/O operation → less network traffic → faster query execution time → less time wasted → less time wasted → less life wasted!

So let’s say you want to query my_table with the following query and your parquet files have the following metadata

query and my_table’s parquet metadata sample

Then, instead of pulling all 3 files and filter them inside your query engine be it Presto or Spark or whatever, it will pull only a single file file_part03 because the parquet reader knows that both in file-part01 and file-part02 the user_id are less or equal than 200, so them parquet readerz be like “why bother Dude?” 😎

And you know what’s cooler?
So my friend.. This magic does not only happen at the file-level. WAIT WHAATT?? Yes! This magic also works at therow_group -level inside each parquet’s file. So you might end up pulling only a half or a quarter of a file that satisfies your predicate. Ain’t them Predicate Pushdown awesome, huh? 😉

But but but... Regarding this heavenly promises of Predicate Pushdown, I myself have never seen it with my own eyes of it in action, let alone see the proven optimization that makes my queries blazing fast.

So I wonder.. does it really exist tho or is it merely a conspiracy theory to brainwash innocent young Data Engineers making them believe that they can wipe out the world hunger and poverty with Big Data, Deep Learning and Blockchain (?) hmmm 🤔

If you can’t measure it — it doesn’t exist.

That quote that I read I don’t know where, rebounds in my ears, burning my inner empirical spirit, thirsty for empirical evidence 🔥🔥

So I jumped and surfed the web for hours, searching for hints on how to measure this Predicate Pushdown witchery but to only return empty-handedly 🙁☹️

I sighed.. took a deep breath, and soldiered on! And then, I remembered that I joined Presto slack channel, so I asked a question there. Then a couple of minutes later, there came a light of hope, an answer by an angel named Tom Nats..

“…each orc/parquet file’s footer is read and those stats are used to determine if the file needs to continue being read.”

So I rushed back from my hopelessness to download and check my parquet’s metadata, looking for the parquet’s metadata, for a sign of hope… 🏃‍♂️💨

my_table parquet metadata

Based on that metadata, we can see that user_id has stats! So I typed my query, pushed enter as hard as I could hoping my query reached Presto faster!

And the Mystery of Predicate Pushdown unraveled...

As I queried my_table with user_id beyond the max, it pulled 0 rows 0 bytes out of 132,300,443 rows 630.84MB from my parquet 🥺🥺
it pulled NOTHING!

query with a predicate of user_id beyond the max
query with a predicate that should satisfy the whole rows of my_table

And as I decreased the value of my predicate little by little below the max, it pulled more and more rows and bytes little by little too.

query with a predicate slightly below the max pull the data little by little

Now that I have seen the evidence of Predicate Pushdown existence with my own eyes, my spirit is at peace, my body is at ease and I can now rest in peace… ⚰️

So follow me on Medium for more of my writings! if I (ever) write again tho 🤗 Feel free to drop your comments, suggestions or corrections, I’ll be the luckiest boy learning from you guys 🥰

--

--