Predicate Pushdown for Apache Spark with Google BigQuery

Does predicate pushdown for Databricks on Google Cloud with BigQuery work? It does! And here is how to verify it.

Frank Munz
Geek Culture

--

When I tested the features of the recently released Databricks on the Google Cloud platform, I checked out the BigQuery integration. Databricks is using a fork of the open-source Google Spark Connector for BigQuery. So I wondered how to check if a certain predicate of a query is indeed pushed down to BigQuery (or not). It turns out it is easy!

Let’s take the natality public data set in Google BigQuery. The code from the notebook cell below uses the Spark storage API to load a table from BigQuery to a dataframe, pushing down the filter() predicate to BigQuery.

You can get the execution plan of the query including its optimizations with explain() method. For a more verbose output use explain(“extended”)

df.explain()

The mini example above is available as part of a notebook I used for further Databricks on GCP BigQuery tests. The full output of explain() contains the execution plan and lists all the optimizations applied. Look for the part describing the predicates pushed to BigQuery containing the keyword PushedFilters:

PushedFilters: [*IsNotNull(state), *IsNotNull(weight_pounds), *EqualTo(state,CA), *GreaterThan(weight_pounds,11.0)]

The output above shows that the predicates pushed down to BigQuery are exactly the conditions of the Spark query.

Databricks Spark on GCP optimizes for

  • nested filter pushdown and nested column pruning
  • array pushdown
  • expression pushdown

A notebook containing all the Spark query optimization examples above can be found in the Databricks documentation.

Where to go from here?

[1] Databricks Integration with BigQuery blog post
[2] Databricks on Google Cloud free trial
[3] BigQuery Sample Notebook

Please clap for this article if you enjoyed reading it as much as I enjoyed writing it. I spend way too much time on Twitter — feel free to connect: @frankmunz.

--

--

Frank Munz
Geek Culture

Cloudy things, large-scale data & compute. Twitter @frankmunz. Former Tech Evangelist @awscloud, Principal @Databricks now. personal opinions here. #devrel ❤️.