GOOGLE BIGQUERY CONSTRAINTS FOR JOIN OPTIMIZATION STRATEGY

PRABHUKKARTHI STB
Google Cloud - Community
7 min readOct 10, 2023

Google BigQuery recently launched the constraints namely Primary Key (PK) and Foreign Key (FK) as a un-enforced constraints as of now. This blog is about unlocking the true potential of these BigQuery constraints additions in Google Cloud and how to effectively utilize them to gain the real benefits in your data pipelines!

DEPICTION OF BIGQUERY OPTIMIZED EXECUTION GRAPH | Source: Self

To begin, I’ll provide a brief overview of table constraints. Constraints are basically applied to the columns of a table which are required in a database to technically maintain the integrity, correctness, and completeness of the data, apart from domain-level constraints applied at the business level. Table Constraints can be classified into Unique, Not Null, Default, Check, Primary Key and Foreign Key constraints.

BigQuery is an analytical data warehouse that adheres to denormalized schemas. It is primarily designed to handle large volumes of data for analytical insights that may not require 100% accuracy compared to OLTP databases, therefore it does not require these constraints. But BigQuery’s recent launch of two non-enforced constraints: Primary Key (unique not null columns of the table) and Foreign Key (column that references the Unique or Primary Key column of another table as a referential constraint) left me with a sense of surprise!

When I dive deep into it, I discovered that the real reason behind this is not to introduce traditional constraint concepts into BigQuery. Instead, Google is mainly leverage these familiar traditional SQL concepts and extends their BigQuery optimization approach. In this context, the BigQuery Optimizer is prompted by these constraints to create an optimized query execution graphs/plans, particularly when CONSTRAINTS (PK/FK) guarantee unique rows, which helps optimizing JOINS and reduces the compute cost where JOIN is one of the most costly operations in the SQL world! These constraint features support both new tables and existing tables as well.

Enforced constraints will throw an error whenever a constraint violation occurs based on new data.

Un-enforced or non-enforced constraints mean that the table will not throw any errors during constraints violations. Instead, the user should take the necessary steps to ensure that the constraints are maintained while loading the data.

3 BigQuery Optimization Techniques for Join Operations using Constraints Strategy

  1. INNER JOIN ELIMINATION
  2. OUTER JOIN (LEFT, RIGHT) ELIMINATION
  3. JOIN REORDERING

We will take a detailed look at each of them. BigQuery Public domain data will be utilized in this blog to explain each of the techniques mentioned above.

--Using these tables directly to show the results before applying constraints
bigquery-public-data.baseball.games_wide - FACT Table
bigquery-public-data.baseball.schedules - DIMENSION Table
bigquery-public-data.baseball.games_post_wide - FACT Table

--copied above tables to my local project as well to show the results after applying constraints
bq_poc.games_wide_cons - FACT Table (bigquery-public-data.baseball.games_wide)
bq_poc.schedules_cons - DIMENSION Table (bigquery-public-data.baseball.schedules)
bq_poc.games_post_wide_cons - FACT Table (bigquery-public-data.baseball.games_post_wide)

--Alter the local tables to create Primary Key & Foregin Key constraints
ALTER TABLE bq_poc.schedules_cons ADD PRIMARY KEY (gameid) NOT ENFORCED;

ALTER TABLE bq_poc.games_wide_cons ADD PRIMARY KEY (gameid) NOT ENFORCED,
ADD FOREIGN KEY (gameid) REFERENCES bq_poc.schedules_cons(gameid) NOT ENFORCED;

ALTER TABLE bq_poc.games_post_wide_cons ADD PRIMARY KEY (gameid) NOT ENFORCED,
ADD FOREIGN KEY (gameid) REFERENCES bq_poc.schedules_cons(gameid) NOT ENFORCED;

1. INNER JOIN ELIMINATION TECHNIQUE

This technique is mainly for the INNER JOIN operation performing queries, which can be fine tuned by prompting the optimizer to eliminate the INNER JOIN Step itself in the Execution Graph/Plan if it meets any of the following conditions:

  1. You can select all the columns or only specific columns from your first table if you follow the Broadcast join principle, where a big table is first followed by a smaller table as the second.
  2. You can select all the columns or only specific columns from your second table if you follow the reverse broadcast join principle, where a small table is first followed by a bigger table as the second.

Example:

Before constraints applied:

SELECT
FACT.*
FROM
`bigquery-public-data.baseball.games_wide` FACT
INNER JOIN
`bigquery-public-data.baseball.schedules` DIM
ON
FACT.gameId = DIM.gameiD;
EXECUTION GRAPH WITH JOIN STEP BEFORE CONSTRAINTS | Source: Self

After constraints applied:

SELECT
FACT.*
FROM
`bq_poc.games_wide_cons` FACT
INNER JOIN
`bq_poc.schedules_cons` DIM
ON
FACT.gameId = DIM.gameiD;
EXECUTION GRAPH WITH JOIN STEP ELIMINATION AFTER CONSTRAINTS | Source: Self

2. OUTER JOIN (LEFT, RIGHT) ELIMINATION TECHNIQUE

This technique is mainly for the OUTER JOIN (Left, Right) operation performing queries which can be fine tuned by prompting the optimizer to eliminate the OUTER JOIN Step itself in the Execution Graph/Plan if it meets the following condition:

  1. For a LEFT OUTER JOIN, it’s possible to select all or specific columns from your LEFT table, and the join keys for the RIGHT table must be unique.
  2. For a RIGHT OUTER JOIN, it’s possible to select all or specific columns from your RIGHT table, and the join keys for the LEFT table must be unique.

Before constraints applied:

SELECT
DIM.*
FROM
`bigquery-public-data.baseball.games_wide` FACT
RIGHT OUTER JOIN
`bigquery-public-data.baseball.schedules` DIM
ON
FACT.gameId = DIM.gameiD;
EXECUTION GRAPH WITH JOIN STEP BEFORE CONSTRAINTS | Source: Self

After constraints applied:

SELECT
DIM.*
FROM
`bq_poc.games_wide_cons` FACT
RIGHT OUTER JOIN
`bq_poc.schedules_cons` DIM
ON
FACT.gameId = DIM.gameiD;
EXECUTION GRAPH WITH JOIN STEP ELIMINATION AFTER CONSTRAINTS | Source: Self

Note: The number of output records in the before and after constraints are different as before constraints result have duplicate records whereas after constraints result have only unique records. Because constraints are in place which helps BigQuery Optimizer to automatically removes duplicate records created due to FACT table and gives only the unique gameid from DIM table as its an RIGHT OUTER JOIN.

For FULL OUTER JOIN, there is currently no optimization technique available through constraints strategy.

3. JOIN REORDERING TECHNIQUE

This optimized technique will be able to achieve by having constraints in place even if you couldn’t eliminate the JOIN step as per the above two techniques as your requirements does not satisfy those conditions. By inferring join cardinality, these constraints enable the optimizer to have better optimized join reordering, which improves other parameters such as Elapsed Time, Slot time consumed, and minimum Bytes Shuffled.

For the example below, I require all the fields from both dimension and fact tables in a RIGHT OUTER JOIN operations. Despite not meeting the RIGHT OUTER JOIN elimination condition, Join Reordering helps optimization of other parameters of the query which reduces the cost a lot.

Before constraints applied:

SELECT
DIM.*, FACT.*
FROM
`bigquery-public-data.baseball.games_wide` FACT
RIGHT OUTER JOIN
`bigquery-public-data.baseball.schedules` DIM
ON
FACT.gameId = DIM.gameiD;
ELAPSED TIME, SLOT TIME, BYTES SHUFFLED — BEFORE CONSTRAINTS | Source: Self

After constraints applied:

SELECT
DIM.*, FACT.*
FROM
`bq_poc.games_wide_cons` FACT
RIGHT OUTER JOIN
`bq_poc.schedules_cons` DIM
ON
FACT.gameId = DIM.gameiD;
OPTIMIZED ELAPSED TIME, SLOT TIME, BYTES SHUFFLED — AFTER CONSTRAINTS |Source: Self

Note: There may be only a minor difference observed in this example as it depends on the volume and cardinality of the data involved.

Practical usage of these optimized techniques

In Dimensional modeling data warehouses, we create a single denormalized large view by joining many fact and dimensional tables in most scenarios to simplify usage. These views are frequently utilized in downstream applications like web APIs, dashboards and ML models where each query requests selects only the required columns based on the business requirements. These constraints help to prioritize the required tables based on column selection and avoid all other joins in those scenarios.

In the below example, though the query has three tables, downstream is only interested in the first fact table with specific columns. As a result, the constraints played a significant role and eliminated all the join conditions as shown below:

SELECT
FACT.gameiD, FACT.seasonid, FACT.seasontype
FROM
`bq_poc.games_wide_cons` FACT
LEFT OUTER JOIN
`bq_poc.schedules_cons` DIM
ON
FACT.gameId = DIM.gameiD
LEFT OUTER JOIN
`bq_poc.games_post_wide_cons` DIM2
ON
FACT.gameId = DIM2.gameiD ;
EXECUTION GRAPH WITH JOIN STEP ELIMINATION AFTER CONSTRAINTS APPLIED IN A MULTI JOIN QUERY Source: Self

As BigQuery constraint is currently un-enforced by default, if your source table columns violate the Primary key and Foreign key constraints, then optimization will not work properly, leading to incorrect results and execution graph/plan. Therefore, ensuring that you maintain the constraints while loading data to your tables by handling it in your ETL workloads is crucial.

Conclusion

Join Optimized queries can be achieved by simply applying PK and FK constraints to the data tables and maintaining the constraints without violations. I believe that these advanced optimization strategies are quite beneficial for composing fine-tune queries by prompting the optimizer to simplify join operations in BigQuery Execution Graph/Plan. This is a significant advancement to reduce the majority of compute cost as joins are always costly and cannot be avoided in any SQL!

By introducing these constraints as un-enforced, Google BigQuery the NEXTGEN denormalized analytical data warehouse retains its significance and expanding itself for new innovation in join optimization strategies rather than adopting traditional database concepts as an enforced constraints, which is a thoughtful approach from Google!

Note: I manually wrote all of these content! It is not produced by any AI tool.

References

PFB official Google Cloud blog on this topic:

--

--

PRABHUKKARTHI STB
Google Cloud - Community

Senior AVP - GCP Business Consulting at HSBC || Google Cloud Certified Professional Data Engineer || Master of Technology — Data Analytics From BITS Pilani