Long Running Query Optimization Guide for Bigquery
Introduction
Sometimes, queries run longer than you would like. In general, queries that do less work perform better. They run faster and consume fewer resources, which can create lower costs and fewer failures.
This document provides an overview of optimization techniques that can improve query performance in BigQuery.
Optimization Techniques:
- Filter Data before Join
Example: Finding the aggregate sales for a product on a particular region
select sum(sales),region_name, product_name from sales_tbl sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(region_name)=’APAC’ and upper(product_name) =’KEYBOARD’
group by region_name, product_name;
The above query can be optimized by applying the filter to the region and product table first using a sub-query , followed by joining with the sales table.
The optimized query is as follows:
select sum(sales),region_name, product_name from sales_tbl sales
inner join (select * from region_tbl region where upper(region_name)=’APAC’ ) region
on sales.region_id = region.region_id
inner join (select * from product_tbl product where upper(product_name) =’KEYBOARD’ ) product
on sales.product_id = product.product_id
group by region_name, product_name
Another scenario is to remove duplicates before join
Example: region_name from the sales table contains the duplicate values, resulting in the cross join.
select product_name from product_tbl product
inner join sales_tbl sales
on product.region_name = sales.region_name;
In order to avoid this , the query can be optimized by removing the duplicates from the sales table
The optimized query is as follows:
select product_name from product_tbl product
inner join (select region_name from sales_tbl group by region_name) sales
on product.region_name = sales.region_name;
2. Select Only Needed Columns
From the above example (Point 1) ,the query can be further optimized by replacing “ select * “ with “ select <column_list> “
select sum(sales),region_name, product_name from sales_tbl sales
inner join (select region_name,region_id from region_tbl region where upper(region_name)=’APAC’ ) region
on sales.region_id = region.region_id
inner join (select product_name,product_id from product_tbl product where upper(product_name) =’KEYBOARD’ ) product
on sales.product_id = product.product_id
group by region_name,product_name
3. Replace Distinct with Group_by
Example: Finding the distinct products from APAC region
select distinct product.product_name from region_tbl region
inner join product_tbl product
on region.region_id = product.region_id
where region_name = ‘APAC’
The above query can be optimized by replacing distinct with group by
select product.product_name from region_tbl region
inner join product_tbl product
on region.region_id = product.region_id
where region_name = ‘APAC’
group by product_name
4. Put the large table as the first table in the join clause
From the above example (Point 3), the Product table is the largest table compared to the Region table. Hence the query can be further optimized by changing the order of tables in the join clause as well in the join condition.
select product.product_name from product_tbl product
inner join region_tbl region
on product.region_id = region.region_id
where region_name = ‘APAC’
group by product_name
5. Use Partition, Clustering Columns first in Filter/Join Clause
Assume that the sales table is partitioned by sales_date and clustered by product_id.
select sum(sales), product_name
from sales
inner join product_tbl product
on sales.region_id =’APAC’
and sales.product_id = product.product_id
and on sales.sales_date = (select date from current_date_tbl )
group by region_name, product_name
The above query can be optimized as below
select sum(sales), product_name
from sales
inner join product_tbl product
on sales.sales_date = (select date from current_date_tbl )
and sales.product_id = product.product_id
and sales.region_id =’APAC’
group by region_name, product_name
As we can see, partition_column (sales_date) filter is applied first, followed by clustering column (product_id) filter, followed by normal column filter (region_id)
6. Replace multiple cte and view references with temporary table
In the below example, the CTE temp_sales is being referenced two times. Hence during the execution, the CTE is executed twice. Inorder to avoid this, convert the CTE into a temp table and use the temp table instead of CTE.
with temp_sales as
( select * from sales_tbl where region_id = ‘R100’ )select sum(sales),region_name, product_name
from temp_sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’KEYBOARD’
group by region_name,product_nameUnion all
select sum(sales),region_name, product_name
from temp_sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’LAPTOP’
group by region_name, product_name;
The above query can be optimized as below
Here, the cte “temp_sales” is converted to temporary table “temp_sales”
Create temporary table temp_sales as
( select * from sales_tbl where region_id = ‘R100’ )select sum(sales),region_name, product_name
from temp_sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’KEYBOARD’
group by region_name, product_nameUnion all
select sum(sales),region_name, product_name
from temp_sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’LAPTOP’
group by region_name, product_name;
7. Split Large queries into temp tables
If the query is too complex or resource intensive, the query can be split into multiple parts using temporary tables.
select sum(sales),region_name, product_name
from sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’KEYBOARD’
group by region_name,product_nameUnion all
select sum(sales),region_name, product_name
from sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’LAPTOP’
group by region_name, product_name;
The above query can be optimized by creating two temp tables and then doing a union of those two temp tables. Also the output of the temp table can be used multiple times in the multi-line statement without any re-computation.
The optimized query is as below
create temporary table keyboard_temp as
(
select sum(sales),region_name, product_name
from sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’KEYBOARD’
group by region_name, product_name);
create temporary table laptop_temp as
(select sum(sales),region_name, product_name
from sales
inner join region_tbl region
on sales.region_id = region.region_id
inner join product_tbl product
on sales.product_id = product.product_id
where upper(product_name) =’LAPTOP’
group by region_name, product_name);
select * from keyboard_temp
union all
select * from laptop_temp
I frequently write about Google Cloud Technologies and Optimization Technique. Feel free to follow me for future articles.
Remarks:
This document compliments the google cloud documentation (https://cloud.google.com/bigquery/docs/best-practices-performance-overview )
Appendix:
My other articles from Bigquery for your reference: