Optimize BQ SQL: Cost & Performance with Clustering

Published in

Google Cloud - Community

6 min readSep 4, 2024

Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.

In the realm of Big Data analytics, where massive datasets are the norm, query efficiency is paramount. Every millisecond of processing time and byte of data scanned directly impacts query costs.

Therefore, optimizing queries to minimize data processing and expedite results is crucial. Inefficient queries against large datasets can lead to substantial financial burdens, causing project delays or even outright cancellation.

Google BigQuery’s clustering feature is one of the potential solutions to this challenge that can be explored to significantly enhance query performance.

Let’s explore the intricacies of BigQuery clustering and the advantages it confers.

What is Clustering ?

Imagine you have a big box of toys. It’s so big that it takes a long time to find the toy you want to play with!

Clustering in BigQuery is like organizing your toys into smaller boxes. You put all the cars in one box, all the dolls in another, and all the building blocks in a third box. Now, when you want to play with cars, you only have to look in the car box!

BigQuery does the same thing with your data. It groups similar pieces of information together, so when you ask it a question, it doesn’t have to look through everything. It can go straight to the “box” where the answer is, making it much faster and less costly!

Why to apply clustering ?

Now that we understand what clustering is, why do we actually need to apply it? What are the benefits of clustering?

Improved Query Performance: Clustering speeds up queries that filter or aggregate data based on the clustered columns. Since BigQuery can directly access the relevant data blocks, it avoids scanning the entire table.
Cost Reduction: Faster queries translate to lower costs. BigQuery’s pricing is based on the amount of data processed, so efficient queries save you money.

When to apply clustering ?

Given the demonstrated benefits of clustering in terms of cost reduction and improved query performance, the logical subsequent inquiry is, In which scenarios or use cases is the application of clustering can be applicable or advantageous?

Do you have frequently accessed and long running queries on bigquery table with filter conditions applied ?
Is your big query table size more than 10 GB and any of the columns have distinct values?
Do you have queries that have filter conditions on columns with distinct values ?
Is the table being queried frequently with filters applied ?

If the answers to any of the above questions is yes, then you may consider clustering these tables on the specific columns

Let’s see this via an example

For an example, we will consider the bigquery public dataset chicago_taxi_trips table taxi_trips

The table has ~211.6 Million records and is 76 GB in size

The taxi_trips table schema is as below

Consider the column payment_type. It has distinct values

Considering the table size and the column payment_type having distinct values, this can be a good candidate for clustering in scenarios where you want to query the table for a specific or set of specific payment types.

To demonstrate the difference in clustering, we created another table as a copy of this table with clustering applied on the field payment_type

Let’s compare the query performance in both the scenarios

Query on non clustered table

SELECT * FROM bigquery-public-data.chicago_taxi_trips.taxi_trips WHERE payment_type = “Cash”

The above query on a non clustered table scans the entire table i.e. 76 GB’s of data.

Apply the query on clustered table

The same query when applied on a clustered table scans less bytes

Since less bytes are scanned, there is less consumption of resources and hence less query costs.

How to apply clustering

Having demonstrated the practical impact of clustering, let’s explore its implementation. While clustering can be defined during table creation, outcome of performance optimization exercises may necessitate adding clustering to existing tables. Thankfully, BigQuery allows clustering on both new and existing tables. We will now outline the methods and considerations for applying clustering.

Identify Clustering Candidates:

Begin by identifying the columns frequently used in your query filters or aggregations. These are the prime candidates for clustering.

Add Clustering During Table Creation:

One approach to incorporate clustering into a table during its creation is to include the CLUSTER BY clause within the CREATE TABLE DDL statement. The syntax is as follows:

CREATE TABLE tablename (fieldname datatype) CLUSTER BY fieldname;

The addition of CLUSTER BY fieldname instructs BigQuery to cluster the table based on the specified field. To cluster on multiple fields, simply separate the field names with commas within the CLUSTER BY clause.

Add clustering to an existing table

To apply clustering to an existing table, the bq update command can be executed in cloud shell

bq update — clustering_fields=fieldname dataset.table_name

While technically we can apply clustering to an existing non clustered table, however , it will only be applied to new records inserted or updated in the table. In order to make sure that all your table records are considered for clustering, you can follow any of the below approaches

Approach 1 : Truncate -> Cluster -> Reload

Take a backup of your table data in another bq table
Truncate the table
Apply clustering
Reload the table from the backup table
Delete the backup table

Approach 2 : Cluster -> Update All Records

Apply clustering to existing table
Update all records in the table using a command as below

UPDATE dataset.table SET fieldname=fieldname WHERE true

The above techniques will make sure that all table records are considered for clustering

Wait for clustering

BigQuery performs clustering in the background. The duration depends on the table size and the number of clustered columns.

Important Considerations:

Number of Clustered Columns: BigQuery allows up to four clustering columns per table. Choose the most impactful columns based on your query patterns.
Data Distribution: Clustering is most effective when the clustered columns have a reasonable number of distinct values. Avoid clustering on columns with very high cardinality (too many unique values) or very low cardinality (too few unique values).

Conclusion

BigQuery clustering is a valuable technique for optimizing query performance and cost-efficiency. By organizing your data strategically, you can unlock faster insights and streamline your Big Data workflows. Remember to carefully select your clustering columns, monitor query patterns, and re-cluster when necessary to keep your data in top shape.

Pro-Tip: Analyze your query logs to identify the most frequently used filter and aggregation columns. These insights will help you choose the optimal clustering strategy for your tables.

With the power of clustering at your disposal, you can harness the full potential of BigQuery and embark on a journey of seamless data exploration.

Optimize BQ SQL: Cost & Performance with Clustering

How to apply clustering

Conclusion

Written by Naitik Gada