Partitioning and Clustering in Google BigQuery
Google BigQuery is a great tool from Google Cloud Platform.
BigQuery is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data. Use built-in ML/AI and BI for insights at scale.
One thing to consider when using BigQuery is to optimize our table storage, it can make our query faster and cheaper.
*This also apply for any other data warehouse as well.
In this article, I would like to share about the comparison of bytes processed between normal table (base), partitioned table, clustered table and combination of partition/cluster table.
What is partitioned and clustered table?
Partitioned table is like divide our large table into smaller tables by a partition column.
Read more about partitioned table in this LINK.
Clustered table is like sorting data in the storage by cluster columns’ value, all records are still in the same table.
Read more about clustered table in this LINK.
Combination of partitioned and clustered table is like storage data in many smaller tables while sorted data inside those smaller tables.
Base vs Partition vs Cluster
In order to compare bytes processed. I will use avocado dataset that I got from Google Business Intelligence Professional Certificate’s second course “The Path to Insights: Data Models and Pipelines”.
First, this is a base table with data from 2015 until 2021 as picture below.
Then I created partitioned and clustered tables with these statements.
-- For Partition Table
CREATE TABLE
`project-id.dataset-id.partition`
PARTITION BY
RANGE_BUCKET(year, GENERATE_ARRAY(2015,2022,1))
AS
(SELECT * FROM `project-id.dataset-id.base`)
-- For Cluster Table
CREATE TABLE
`project-id.dataset-id.cluster`
CLUSTER BY
year
AS
(SELECT * FROM `project-id.dataset-id.base`)
After created tables, I will use this query to see how much byte processed.
SELECT
*
FROM
`project-id.dataset-id.table-name`
WHERE
year > 2015
AND
type = 'Organic'
AND
region = 'West'
And this is the result of the queries.
In this case, the partitioned table reduced bytes processed a lot, by scan only the relevant partition, while clustered table could not reduce.
So, I tried query data of year 2022 which doesn’t exist in tables.
As you can see, when queried data that doesn’t exist in tables, the base table scanned the whole table while the clustered table didn’t scan.
One thing to keep in mind is that the clustered table cannot tell us exactly how much byte will really process before query.
To summarize.
Combination of Partition and Cluster
From the previous section, we found that partitioned and clustered table can help reduce bytes processed in our query, in one way or another...
What about combining both partition and cluster in one table?
So, in this section, I created tables with combination of partitioned and various cluster columns with statements below.
-- Partition by year, Cluster by region
CREATE TABLE
`project-id.dataset-id.partition_1cluster_region`
PARTITION BY
RANGE_BUCKET(year, GENERATE_ARRAY(2015,2022,1))
CLUSTER BY
region
AS
(SELECT * FROM `project-id.dataset-id.base`)
-- Partition by year, Cluster by type
CREATE TABLE
`project-id.dataset-id.partition_1cluster_type`
PARTITION BY
RANGE_BUCKET(year, GENERATE_ARRAY(2015,2022,1))
CLUSTER BY
type
AS
(SELECT * FROM `project-id.dataset-id.base`)
-- Partition by year, Cluster by type and region
CREATE TABLE
`project-id.dataset-id.partition_2cluster`
PARTITION BY
RANGE_BUCKET(year, GENERATE_ARRAY(2015,2022,1))
CLUSTER BY
type, region
AS
(SELECT * FROM `project-id.dataset-id.base`)
And this is the result of bytes processed using the same query.
As you can see, every combination tables reduced bytes processed into 2.XX MB, while the partitioned-only table which consume 3.77 MB.
And the funny thing is that the table with 2 cluster columns didn’t have the lowest bytes processed, but the highest instead.
To summarize everything.
As we can see from the chart, combining of partitioning and clustering can reduce the bytes processed almost 40-50%.
Hope this article help you see a better picture of the benefits of partitioned and clustered table.