Long Running Query Optimization Guide for Bigquery

Anas Aslam
Google Cloud - Community
5 min readNov 22, 2022

Introduction

Sometimes, queries run longer than you would like. In general, queries that do less work perform better. They run faster and consume fewer resources, which can create lower costs and fewer failures.

This document provides an overview of optimization techniques that can improve query performance in BigQuery.

Optimization Techniques:

  1. Filter Data before Join

Example: Finding the aggregate sales for a product on a particular region

select sum(sales),region_name, product_name from sales_tbl sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(region_name)=’APAC’ and upper(product_name) =’KEYBOARD’
group by region_name, product_name;

The above query can be optimized by applying the filter to the region and product table first using a sub-query , followed by joining with the sales table.

The optimized query is as follows:

select sum(sales),region_name, product_name from sales_tbl sales
inner join (select * from region_tbl region where upper(region_name)=’APAC’ ) region
on sales.region_id = region.region_id
inner join (select * from product_tbl product where upper(product_name) =’KEYBOARD’ ) product
on sales.product_id = product.product_id
group by region_name, product_name

Another scenario is to remove duplicates before join

Example: region_name from the sales table contains the duplicate values, resulting in the cross join.

select product_name from product_tbl product
inner join sales_tbl sales
on product.region_name = sales.region_name;

In order to avoid this , the query can be optimized by removing the duplicates from the sales table

The optimized query is as follows:

select product_name from product_tbl product
inner join (select region_name from sales_tbl group by region_name) sales
on product.region_name = sales.region_name;

2. Select Only Needed Columns

From the above example (Point 1) ,the query can be further optimized by replacing “ select * “ with “ select <column_list> “

select sum(sales),region_name, product_name from sales_tbl sales
inner join (select region_name,region_id from region_tbl region where upper(region_name)=’APAC’ ) region
on sales.region_id = region.region_id
inner join (select product_name,product_id from product_tbl product where upper(product_name) =’KEYBOARD’ ) product
on sales.product_id = product.product_id
group by region_name,product_name

3. Replace Distinct with Group_by

Example: Finding the distinct products from APAC region

select distinct product.product_name from region_tbl region
inner join product_tbl product
on region.region_id = product.region_id
where region_name = ‘APAC’

The above query can be optimized by replacing distinct with group by

select product.product_name from region_tbl region
inner join product_tbl product
on region.region_id = product.region_id
where region_name = ‘APAC’
group by product_name

4. Put the large table as the first table in the join clause

From the above example (Point 3), the Product table is the largest table compared to the Region table. Hence the query can be further optimized by changing the order of tables in the join clause as well in the join condition.

select product.product_name from product_tbl product
inner join region_tbl region
on product.region_id = region.region_id
where region_name = ‘APAC’
group by product_name

5. Use Partition, Clustering Columns first in Filter/Join Clause

Assume that the sales table is partitioned by sales_date and clustered by product_id.

select sum(sales), product_name
from sales
inner join product_tbl product
on sales.region_id =’APAC’
and sales.product_id = product.product_id
and on sales.sales_date = (select date from current_date_tbl )
group by region_name, product_name

The above query can be optimized as below

select sum(sales), product_name
from sales
inner join product_tbl product
on sales.sales_date = (select date from current_date_tbl )
and sales.product_id = product.product_id
and sales.region_id =’APAC’
group by region_name, product_name

As we can see, partition_column (sales_date) filter is applied first, followed by clustering column (product_id) filter, followed by normal column filter (region_id)

6. Replace multiple cte and view references with temporary table

In the below example, the CTE temp_sales is being referenced two times. Hence during the execution, the CTE is executed twice. Inorder to avoid this, convert the CTE into a temp table and use the temp table instead of CTE.

with temp_sales as
( select * from sales_tbl where region_id = ‘R100’ )

select sum(sales),region_name, product_name
from temp_sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’KEYBOARD’
group by region_name,product_name

Union all

select sum(sales),region_name, product_name
from temp_sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’LAPTOP’
group by region_name, product_name;

The above query can be optimized as below
Here, the cte “temp_sales” is converted to temporary table “temp_sales”

Create temporary table temp_sales as
( select * from sales_tbl where region_id = ‘R100’ )

select sum(sales),region_name, product_name
from temp_sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’KEYBOARD’
group by region_name, product_name

Union all

select sum(sales),region_name, product_name
from temp_sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’LAPTOP’
group by region_name, product_name;

7. Split Large queries into temp tables

If the query is too complex or resource intensive, the query can be split into multiple parts using temporary tables.

select sum(sales),region_name, product_name
from sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’KEYBOARD’
group by region_name,product_name

Union all

select sum(sales),region_name, product_name
from sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’LAPTOP’
group by region_name, product_name;

The above query can be optimized by creating two temp tables and then doing a union of those two temp tables. Also the output of the temp table can be used multiple times in the multi-line statement without any re-computation.

The optimized query is as below

create temporary table keyboard_temp as
(
select sum(sales),region_name, product_name
from sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’KEYBOARD’
group by region_name, product_name

);

create temporary table laptop_temp as
(

select sum(sales),region_name, product_name
from sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’LAPTOP’
group by region_name, product_name

);

select * from keyboard_temp
union all
select * from laptop_temp

I frequently write about Google Cloud Technologies and Optimization Technique. Feel free to follow me for future articles.

Remarks:

This document compliments the google cloud documentation (https://cloud.google.com/bigquery/docs/best-practices-performance-overview )

Appendix:

My other articles from Bigquery for your reference:

--

--