BigQuery — Making segments and clusters
Introduction
Google BigQuery is a powerful tool offered by Google Cloud Platform (GCP). It helps users analyse both structured and unstructured data. BigQuery doesn’t require server management, which makes it easy to use. It can also handle very large amounts of data quickly and efficiently. This makes BigQuery great for working with big datasets.
BigQuery table
A BigQuery table contains records organised in rows, with columns (or fields) composing each record. Tables are defined by a schema that describes column names and data types. You can specify schemas either when creating a table or during data population.
Query performance in BigQuery
When we want to retrieve data from a BigQuery table, we can use SQL queries. However, sometimes we need to optimise these queries to meet specific requirements. BigQuery offers various ways to achieve this optimisation. Some widely used techniques are
- Partitioning
- Clustering
Partitioning Tables
Partitioned tables split data into smaller parts called partitions. This makes it easier to manage and search through data. It also helps make queries faster and cheaper by only looking at the data you need. You create partitions based on a specific column in your table.
When you run a query, you can use the partition column to filter your results. This lets BigQuery skip over data it doesn’t need to look at, which is called “pruning”. Each partition is stored separately. BigQuery uses information about how the data is sorted to figure out how much a query will cost before running it.
Types of partitioning
- Integer based partitioning
- Integer-range partitioning divides a table based on an
INTEGER
column. To create this, specify:
1. Partitioning column
2. Start value (inclusive)
3. End value (exclusive)
4. Interval for each range - Here’s an example of integer range partition specifications:
The table is partitioned on the emp_id
column with 10-value intervals. Partitions contain values 1–10, 11–20, and so on up to 100. Values outside this range go into __UNPARTITIONED__, while NULL values go into __NULL__.
2. Time unit based partitioning
- Tables can be partitioned on
DATE
,TIMESTAMP
, orDATETIME
columns. BigQuery automatically assigns data to partitions based on these column values. - Partition granularity options:
-TIMESTAMP
/DATETIME
: hourly, daily, monthly, or yearly
-DATE
: daily, monthly, or yearly - All partitions use UTC time.
- Example: Monthly partitioning on
DATETIME
column:
3. Ingestion time based partitioning
- Ingestion time partitioning automatically assigns rows to partitions based on data ingestion time. Choose hourly, daily, monthly, or yearly granularity, with partition boundaries in UTC.
- Ingestion-time partitioned tables include a
_PARTITIONTIME
pseudo column, containing ingestion time truncated to the partition boundary. For example, with hourly partitioning:
Select partitioning type
Partition tables by time-unit column or ingestion time with daily, hourly, monthly, or yearly granularity.
- Daily partitioning (default): Ideal for data spread over wide date ranges or continuous additions.
- Hourly partitioning: For high-volume data spanning short periods (< 6 months).
- Monthly/yearly partitioning: For sparse daily data over wide ranges or frequent updates across many dates.
When to use Partitioning? 🤔
- Improve query performance by scanning only part of a table.
- Exceed standard table quotas by limiting operations to specific partition values.
- Estimate query costs before execution using partitioned table pruning.
- Access partition-level management features:
- Set expiration times for automatic partition deletion.
- Write to specific partitions without affecting others.
- Delete partitions without full table scans.
Clustering
Clustered tables in BigQuery use clustered columns to sort data, improving query performance and reducing costs. These columns sort storage blocks based on their values, with block sizes adapting to the table size. Colocation occurs at the storage block level, not individual rows. By using clustering we can also leverage ‘Block pruning’ (mentioned below)
Clustered tables maintain sort properties during modifications. Queries scan only relevant blocks based on clustered columns, significantly reducing processed data. However, this may affect BigQuery’s ability to accurately estimate query costs beforehand.
Cluster Ordering
- The order of clustered columns significantly impacts query performance. To fully leverage clustering benefits, the query filter order must align with the clustered column order and include at least the first clustered column.
- For example: If we have clustered a table by department and salary, a query that filters with department or department and salary is optimised but only with salary is not optimised
When to use clustering? 🤔
Clustering optimises table storage, making it an excellent first choice for enhancing query performance. Consider clustering for these key benefits:
- Tables or partitions larger than 64 MB typically see significant performance gains from clustering. Smaller datasets can be clustered, but the improvement may be minimal.
- Queries that frequently filter on specific columns benefit from clustering, as it allows the query to scan only relevant data blocks.
- For columns with high cardinality (many unique values), clustering provides BigQuery with detailed metadata, accelerating filtered queries.
- Clustering adjusts storage block sizes based on table size. Larger clusters use bigger blocks, smaller ones use smaller blocks. This prevents wasted space, as large blocks store more data efficiently and small clusters avoid unnecessary large blocks.
Block Size!?
- In BigQuery, “block size” refers to the amount of data stored in a single “block” on disk. When using clustering, BigQuery adapts these block sizes based on the data volume in each cluster, optimising both storage and query performance.
- Imagine a table clustered by the
department
column, where the ‘Sales’ department has substantial data while ‘R&D’ has less. - BigQuery might create larger blocks for ‘Sales’ to store its numerous rows together, reducing the total block count and improving efficiency while for ‘R&D’ it might use smaller blocks due to fewer rows. This approach prevents ‘R&D’ blocks from occupying excess space, avoiding storage waste.
Block pruning
Clustered tables reduce query costs through block pruning. BigQuery sorts data based on clustering columns and organises it into blocks. When querying a clustered table with a filter on clustered columns, BigQuery prunes unnecessary blocks, scanning only relevant ones.
Combining partitioning and clustering
Combining table clustering with table partitioning enables finer sorting for optimised queries. Partitioned tables store data in blocks, each containing one partition, and maintain sort metadata. This improves cost estimation but increases metadata overhead as partitions grow.
Tables that are both clustered and partitioned offer the most granular sorting, as shown below:
- Here we have created integer based partitioning on
emp_id
column (start — 1, end — 30, interval — 10) and clustering ondepartment
column resulting in three partitions which are also clustered.
SELECT department, COUNT(*) AS employee_count
FROM employee_table
WHERE emp_id BETWEEN 21 AND 30
GROUP BY department;
- In the above query, the condition
emp_id BETWEEN 21 AND 30
, which limits the scan to P2 (Partition 21-30) only, bypassing partitions P0 and P1. This approach significantly reduces the data scanned to just one-third of the table's content. - Also query uses
GROUP BY department
to tally employees in each department. Since P2 is already organised bydepartment
, BigQuery can efficiently group the rows without needing to scan them all randomly.
As a result, this query is both cost-effective and highly performant due to reduced data scanning and efficient organisation of rows.
Note: Combining partitioning and clustering can offer major performance gains, but requires careful consideration. For large, predictable datasets, benefits often outweigh drawbacks. However, smaller or unpredictable datasets may benefit from using one technique or neither, due to added complexity, cost, and potential performance issues.
References: