BigQuery — Making segments and clusters

Aakash Rajput
TechVerito
Published in
6 min readDec 3, 2024
Photo by Ryan Quintal on Unsplash

Introduction

Google BigQuery is a powerful tool offered by Google Cloud Platform (GCP). It helps users analyse both structured and unstructured data. BigQuery doesn’t require server management, which makes it easy to use. It can also handle very large amounts of data quickly and efficiently. This makes BigQuery great for working with big datasets.

BigQuery table

A BigQuery table contains records organised in rows, with columns (or fields) composing each record. Tables are defined by a schema that describes column names and data types. You can specify schemas either when creating a table or during data population.

Query performance in BigQuery

When we want to retrieve data from a BigQuery table, we can use SQL queries. However, sometimes we need to optimise these queries to meet specific requirements. BigQuery offers various ways to achieve this optimisation. Some widely used techniques are

  • Partitioning
  • Clustering

Partitioning Tables

Partitioned tables split data into smaller parts called partitions. This makes it easier to manage and search through data. It also helps make queries faster and cheaper by only looking at the data you need. You create partitions based on a specific column in your table.

When you run a query, you can use the partition column to filter your results. This lets BigQuery skip over data it doesn’t need to look at, which is called “pruning”. Each partition is stored separately. BigQuery uses information about how the data is sorted to figure out how much a query will cost before running it.

Partitions made based on some column criteria

Types of partitioning

  1. Integer based partitioning
  • Integer-range partitioning divides a table based on an INTEGER column. To create this, specify:
    1. Partitioning column
    2. Start value (inclusive)
    3. End value (exclusive)
    4. Interval for each range
  • Here’s an example of integer range partition specifications:
Partitions made on Employee table on emp_id column

The table is partitioned on the emp_id column with 10-value intervals. Partitions contain values 1–10, 11–20, and so on up to 100. Values outside this range go into __UNPARTITIONED__, while NULL values go into __NULL__.

2. Time unit based partitioning

  • Tables can be partitioned on DATE, TIMESTAMP, or DATETIME columns. BigQuery automatically assigns data to partitions based on these column values.
  • Partition granularity options:
    - TIMESTAMP/DATETIME: hourly, daily, monthly, or yearly
    - DATE: daily, monthly, or yearly
  • All partitions use UTC time.
  • Example: Monthly partitioning on DATETIME column:
Partitions created on monthly basis as per datetime

3. Ingestion time based partitioning

  • Ingestion time partitioning automatically assigns rows to partitions based on data ingestion time. Choose hourly, daily, monthly, or yearly granularity, with partition boundaries in UTC.
  • Ingestion-time partitioned tables include a _PARTITIONTIME pseudo column, containing ingestion time truncated to the partition boundary. For example, with hourly partitioning:

Select partitioning type

Partition tables by time-unit column or ingestion time with daily, hourly, monthly, or yearly granularity.

  • Daily partitioning (default): Ideal for data spread over wide date ranges or continuous additions.
  • Hourly partitioning: For high-volume data spanning short periods (< 6 months).
  • Monthly/yearly partitioning: For sparse daily data over wide ranges or frequent updates across many dates.

When to use Partitioning? 🤔

  • Improve query performance by scanning only part of a table.
  • Exceed standard table quotas by limiting operations to specific partition values.
  • Estimate query costs before execution using partitioned table pruning.
  • Access partition-level management features:
    - Set expiration times for automatic partition deletion.
    - Write to specific partitions without affecting others.
    - Delete partitions without full table scans.

Clustering

Clustered tables in BigQuery use clustered columns to sort data, improving query performance and reducing costs. These columns sort storage blocks based on their values, with block sizes adapting to the table size. Colocation occurs at the storage block level, not individual rows. By using clustering we can also leverage ‘Block pruning’ (mentioned below)

Clustered tables maintain sort properties during modifications. Queries scan only relevant blocks based on clustered columns, significantly reducing processed data. However, this may affect BigQuery’s ability to accurately estimate query costs beforehand.

Cluster Ordering

  • The order of clustered columns significantly impacts query performance. To fully leverage clustering benefits, the query filter order must align with the clustered column order and include at least the first clustered column.
  • For example: If we have clustered a table by department and salary, a query that filters with department or department and salary is optimised but only with salary is not optimised

When to use clustering? 🤔

Clustering optimises table storage, making it an excellent first choice for enhancing query performance. Consider clustering for these key benefits:

  • Tables or partitions larger than 64 MB typically see significant performance gains from clustering. Smaller datasets can be clustered, but the improvement may be minimal.
  • Queries that frequently filter on specific columns benefit from clustering, as it allows the query to scan only relevant data blocks.
  • For columns with high cardinality (many unique values), clustering provides BigQuery with detailed metadata, accelerating filtered queries.
  • Clustering adjusts storage block sizes based on table size. Larger clusters use bigger blocks, smaller ones use smaller blocks. This prevents wasted space, as large blocks store more data efficiently and small clusters avoid unnecessary large blocks.

Block Size!?

  • In BigQuery, “block size” refers to the amount of data stored in a single “block” on disk. When using clustering, BigQuery adapts these block sizes based on the data volume in each cluster, optimising both storage and query performance.
  • Imagine a table clustered by the department column, where the ‘Sales’ department has substantial data while ‘R&D’ has less.
  • BigQuery might create larger blocks for ‘Sales’ to store its numerous rows together, reducing the total block count and improving efficiency while for ‘R&D’ it might use smaller blocks due to fewer rows. This approach prevents ‘R&D’ blocks from occupying excess space, avoiding storage waste.

Block pruning

Clustered tables reduce query costs through block pruning. BigQuery sorts data based on clustering columns and organises it into blocks. When querying a clustered table with a filter on clustered columns, BigQuery prunes unnecessary blocks, scanning only relevant ones.

Combining partitioning and clustering

Combining table clustering with table partitioning enables finer sorting for optimised queries. Partitioned tables store data in blocks, each containing one partition, and maintain sort metadata. This improves cost estimation but increases metadata overhead as partitions grow.

Tables that are both clustered and partitioned offer the most granular sorting, as shown below:

Partitioning (emp_id) and clustering (department) used together
  • Here we have created integer based partitioning on emp_id column (start — 1, end — 30, interval — 10) and clustering on department column resulting in three partitions which are also clustered.
SELECT department, COUNT(*) AS employee_count
FROM employee_table
WHERE emp_id BETWEEN 21 AND 30
GROUP BY department;
  • In the above query, the condition emp_id BETWEEN 21 AND 30, which limits the scan to P2 (Partition 21-30) only, bypassing partitions P0 and P1. This approach significantly reduces the data scanned to just one-third of the table's content.
  • Also query uses GROUP BY department to tally employees in each department. Since P2 is already organised by department, BigQuery can efficiently group the rows without needing to scan them all randomly.

As a result, this query is both cost-effective and highly performant due to reduced data scanning and efficient organisation of rows.

Note: Combining partitioning and clustering can offer major performance gains, but requires careful consideration. For large, predictable datasets, benefits often outweigh drawbacks. However, smaller or unpredictable datasets may benefit from using one technique or neither, due to added complexity, cost, and potential performance issues.

References:

--

--

No responses yet