Big Query : Partition and Clustering

Padmanav Agarwal
Nov 5 · 3 min read

Big Query has created some buzz in recent past. The ability to query gigabytes of data in seconds is sure to turn heads. It did turn mine as well.
“This is a no-brainer!! All my large datasets are going here”. However, my beaming eyes were slowly topped with a frown. This power came at a cost — Big Query charges pretty high based on query execution size ( rightly, so ), which means good egress traffic will shoot up my bills.

Let’s discuss some techniques to save cost while enjoying the power of Big Query, using a readily available public dataset in BigQuery- the stack overflow dataset and see how things turn up.

Looking at all posts from 2019 onwards :

SELECT * FROM `bigquery-public-data.samples.stackoverflow_post_questions` WHERE creation_date > “2019–01–01”

A simple query, processing 27.3 GB in around 32 seconds, which is reasonably fast. However, getting billed for 27.3 GB for every fetch might get really expensive! Let’s try to reduce that cost. One answer is to partition the table to ensure faster retrieval of data.

Partitioning basically refers to grouping similar data while storing, based on a particular field(s). Big Query allows partitioning only on date or timestamp fields — either an existing date/time field or a pseudo date/time field, which is created automatically based on ingestion time. This grouping or bucketing ensures faster retrieval of data when filtered on the partitioned field.

Let’s try this on our problem. We will partition the table based on ‘creation_date’ field.

CREATE TABLE `SampleDS.stackoverflow_post_questions_partitioned`
PARTITION BY DATE(creation_date) AS
SELECT * FROM `bigquery-public-data.samples.stackoverflow_post_questions`

The same query on the new partitioned table runs through only 1.6 GB taking only 24 secs. 27 GB to 1.6 GB — Not Bad! Due to partitioning, similar ‘creation_date’ values are stored together, so that the entire table need not be scanned to match the filtering criteria.

So far so good. But, what if we want to filter on fields other than date/time fields ?

The answer is — Clustering. Clustering works with non date/time fields as well, grouping similar data into ‘clusters’ after partitioning.

It is important to note that this is no secret sauce

Let us walk-through another example, clustering our existing table based on ‘tags’ field to compare clustering performance :

CREATE TABLE `SampleDS.stackoverflow_post_questions_part_clus`
PARTITION BY DATE(creation_date)
CLUSTER BY
tags
AS
SELECT * FROM `bigquery-public-data.samples.stackoverflow_post_questions`

To get the sum of scores for posts after 1st January 2019 with tags containing ‘sql’ :

SELECT SUM(score) FROM `SampleProject.SampleDS.stackoverflow_post_questions_part_clus` WHERE creation_date > “2019–01–01” AND tags like ‘%sql%’

This scans through 37.1 MB of data for both clustered and the initial non-clustered tables. So, what went wrong ? Does clustering is a farce with no real benefit ? The problem is actually with our expectation. Since we are checking for tags containing ‘sql’, it needs to scan through all rows looking for that string, irrespective of storage order/grouping. However, if tweak the query to check for only tags beginning with ‘sql’ :

SELECT SUM(score) FROM `SampleProject.SampleDS.stackoverflow_post_questions_part_clus` WHERE creation_date > “2019–01–01” AND tags like ‘sql%’

The non-clustered table still scans through 37.1 MB data ( 938,280 rows in 1st stage), however, the clustered table scans through 36.7 MB ( 930,195 rows in 1st stage). The improvement is achieved as Big Query can skip checking some data rows due to the grouping, and need not scan each data row.

One important thing to note here is that, partitioning & clustering is advisable for only considerably large datasets. In fact, it may become an expensive bottleneck if used in smaller datasets. Also, in case of partitioning/clustering on multiple fields, the same order needs to be maintained during querying as well, to reap the benefits.

That’s all for now. I will be soon posting other insights on Big Query. Please write to me for improvements/suggestions/queries. Keep playing around with Big Query and share your findings here.

Happy Exploring! :)