Data at Scale: Learn How Predicate Pushdown Will Save You Money

Adi Polak
Microsoft Azure
Published in
3 min readDec 18, 2018

--

Even if you spend a lot of time working with data at scale, you might not be aware of Predicate Pushdown and its importance when building products. Wonder what is it good for and why? read this.

What is Predicate Pushdown?

Predicate Pushdown gets its name from the fact that portions of SQL statements, ones that filter data, are referred to as predicates. They earn that name because predicates in mathematical logic and clauses in SQL are the same kind of thing — statements that, upon evaluation, can be TRUE or FALSE for different values of variables or data.

It can improve query performance by reducing the amount of data read (I/O) from Storage files. The Database process evaluates filter predicates in the query against metadata stored in the Storage files.

How does Predicate Pushdown helps process Data at Scale?

The metadata helps the storage process to decide which files are relevant for reading before that happens.

  • If you issue a query in one place to run against a lot of data that’s in another place, you could spawn a lot of network traffic, which could be slow and costly. But …
  • … if you can “push down” parts of the query to where the data is stored, and thus filter out most of the data, then you can greatly reduce network traffic.
  • given the storage metadata, “push down” help us decided which files are relevant and which are not.

A few prerequisites for successful pushdown:

  • Storage is balanced :

Means that the metadata files size is smaller than reading the actual data files.

metadataFiles.size() << actualFiles.size()
  • Storage accurate statistics, metadata and indices.

Why should we care:

When processing data at scale we look at the next bullet points that might hurt our pocket:

  • Network traffic (sending data over the network)
  • I/O ( reading disk files are bounded to disk reading speed)
  • Time (product SLA goals — processing time agreed upon)

When we use the full power of push-down predicate, we save money by reading only the data we need . Which results in faster query processing. Hence, pay only for what is used.

On top of that, we save in network traffic, I/O and precious time.

Examples of how to use it:

Reference:

Follow me on Medium for more posts about Scala, Kotlin,Big data, clean code and software engineers nonsense. Cheers !

--

--

Adi Polak
Microsoft Azure

👩‍💻 Software Engineer 📚 Author of Scaling Machine Learning with Spark (O'Reilly) 🗣️ Keynote Speaker 💫 Databricks ambassador