Cool Sh*t I Just Learned: Parquet’s Predicate Pushdown
Does it really exist? A pursuit of finding Parquet’s Predicate Pushdown empirical evidence 🕵🏻♂️
I have worked with the duet of Presto and Parquet for quite some time now. As you may have known, by using columnar storage, there is this feature called Predicate Pushdown
that can make your query execution time blazing fast ⚡️⚡️⚡️
What is this Predicate Pushdown
?
TL;DR with Predicate Pushdown
, you will only pull data that satisfy your predicate!
less data → less I/O operation → less network traffic → faster query execution time → less time wasted → less time wasted → less life wasted!
So let’s say you want to query my_table
with the following query and your parquet files have the following metadata
Then, instead of pulling all 3 files and filter them inside your query engine be it Presto or Spark or whatever, it will pull only a single file file_part03
because the parquet reader knows that both in file-part01
and file-part02
the user_id
are less or equal than 200, so them parquet readerz be like “why bother Dude?” 😎
And you know what’s cooler?
So my friend.. This magic does not only happen at the file-level. WAIT WHAATT?? Yes! This magic also works at therow_group
-level inside each parquet’s file. So you might end up pulling only a half or a quarter of a file that satisfies your predicate. Ain’t them Predicate Pushdown
awesome, huh? 😉
But but but... Regarding this heavenly promises of Predicate Pushdown
, I myself have never seen it with my own eyes of it in action, let alone see the proven optimization that makes my queries blazing fast.
So I wonder.. does it really exist tho or is it merely a conspiracy theory to brainwash innocent young Data Engineers making them believe that they can wipe out the world hunger and poverty with Big Data, Deep Learning and Blockchain (?) hmmm 🤔
If you can’t measure it — it doesn’t exist.
That quote that I read I don’t know where, rebounds in my ears, burning my inner empirical spirit, thirsty for empirical evidence 🔥🔥
So I jumped and surfed the web for hours, searching for hints on how to measure this Predicate Pushdown
witchery but to only return empty-handedly 🙁☹️
I sighed.. took a deep breath, and soldiered on! And then, I remembered that I joined Presto slack channel, so I asked a question there. Then a couple of minutes later, there came a light of hope, an answer by an angel named Tom Nats..
“…each orc/parquet file’s footer is read and those stats are used to determine if the file needs to continue being read.”
So I rushed back from my hopelessness to download and check my parquet’s metadata, looking for the parquet’s metadata, for a sign of hope… 🏃♂️💨
Based on that metadata, we can see that user_id
has stats! So I typed my query, pushed enter as hard as I could hoping my query reached Presto faster!
And the Mystery of Predicate Pushdown
unraveled...
As I queried my_table
with user_id
beyond the max
, it pulled 0 rows 0 bytes out of 132,300,443 rows 630.84MB from my parquet 🥺🥺
it pulled NOTHING!
And as I decreased the value of my predicate little by little below the max
, it pulled more and more rows and bytes little by little too.
Now that I have seen the evidence of Predicate Pushdown
existence with my own eyes, my spirit is at peace, my body is at ease and I can now rest in peace… ⚰️
So follow me on Medium for more of my writings! if I (ever) write again tho 🤗 Feel free to drop your comments, suggestions or corrections, I’ll be the luckiest boy learning from you guys 🥰