Constraining the centipede

Snowflake join elimination can save you time and money. Learn how to use it to your advantage.

Constraining the centipede by Carole Boulware

Last year, Snowflake quietly released a killer feature that helps organizations dramatically improve query performance — driving down processing times and compute costs. Given that this feature is available for all Snowflake editions, including Standard, it’s shocking that many organizations have not yet tapped into its potential.

The feature in question is Join Elimination, or, as Snowflake refers to it in its documentation: Redundant/Unnecessary Join Elimination through the RELY constraint property. And that’s where we start to run into trouble. Because Snowflake does not enforce constraints (aside from NOT NULL), many users have not taken the time to declare constraints or familiarize themselves with their utility. Who creates redundant joins anyway and why should I bother with RELY?

Let’s understand how join elimination works before answering that question.

How join elimination works

Let’s take a simple query that joins table A, on the left, to table B, on the right, but only selects columns from A:

SELECT A.COL_1
FROM A
LEFT JOIN B
USING (common_column_id)

The join in this query is redundant because it costs time and warehouse credits but does not affect the result. With join elimination, Snowflake can anticipate this behavior and return the same result while avoiding the join.

But the question remains, who would write such a silly join and why bother taking the time to mitigate the inefficiency?

Enter the centipede.

The centipede, a carnivorous credit-eating insect

In the The Data Warehouse Toolkit, Ralph Kimball and Margy Ross introduce readers to the concept of a centipede fact table. The centipede is a fact table surrounded by many dimension tables that have been (overly) normalized to avoid “snowflaking” to the point where it looks like the eponymous insect.

However, the phenomenon is not limited to normalized tables. The same issue occurs in multidimensional views. In large organizations with many business areas and departments, the central fact table can have dozens of business area-specific dimensions linking to it. This is especially true when the company’s business model includes diverse product lines that are often sold together. For example, take a travel operator that offers flights, hotels, and destination activities as part of a combined reservation. Such a company would include many different business units, each specializing in a different product line and using separate hierarchies and dimensions, who all use the same fact table to track overall sales.

Due to the number of dimensions and organizational business logic required to analyze a centipede, views are created to give everyone an easy and uniform way to access their details.

However, without join elimination, selecting any column from the view results in every join in the view being performed.

Let’s find out how to tame this hundred-legged monstrosity.

Constraining the centipede

Thanks to join elimination, Snowflake users now have a RELYable way to constrain the centipede: by ensuring that constraints are set on the source tables and the RELY property is set on the constraints. The RELY property is a keyword included in the constraint definition and acts as a hint for the Snowflake optimizer to rely — or trust — the integrity of the constraint.

The RELY property must be set on the join column in both tables to ensure that the Snowflake query optimizer takes the hint and performs the join elimination. But don’t forget that PKs and FKs are not the only constraints that benefit from RELY — unique constraints (or alternate keys) can also be used. As only one PK can be specified per table, UNIQUE constraints ensure that join elimination can also be performed using non-PK/FK columns.

To demonstrate how this works, we can look at three scenarios that use this basic model as a reference:

For brevity, only the PURCHASE table is used in the examples below, but the same rules apply to CUSTOMER.

Scenario 1 — Creating a new table with constraints

Here, we create a table that contains all four Snowflake-supported constraints: NOT NULL, PK, FK, and UNIQUE. (Reminder, NOT NULL is the only constraint that Snowflake enforces.)

CREATE OR REPLACE TABLE purchase
(
purchase_id number(38,0) NOT NULL,
purchase_skey varchar(32) NOT NULL,
customer_id number(38,0) NOT NULL,
total number(38,2) NOT NULL,
CONSTRAINT pk_purchase PRIMARY KEY ( purchase_id ) RELY,
CONSTRAINT fk_purchase_by_customer FOREIGN KEY ( customer_id ) REFERENCES customer ( customer_id ) RELY,
CONSTRAINT ak_purchase UNIQUE ( purchase_skey ) RELY
);

Scenario 2 — Add a constraint to an existing table

If the tables already existed but did not have constraints defined, the following ALTERS would set things right:

ALTER TABLE purchase ADD CONSTRAINT pk_purchase PRIMARY KEY ( purchase_id ) RELY;
ALTER TABLE purchase ADD CONSTRAINT fk_purchase_by_customer FOREIGN KEY ( customer_id ) REFERENCES customer ( customer_id ) RELY;
ALTER TABLE purchase ADD CONSTRAINT ak_purchase UNIQUE ( purchase_skey ) RELY;

Scenario 3 — Alter an existing constraint to set RELY

If you’ve been following modeling best practices and declaring constraints on your tables, the following ALTER statements will enable the RELY property.

ALTER TABLE purchase ALTER CONSTRAINT pk_purchase RELY;
ALTER TABLE purchase ALTER CONSTRAINT fk_purchase_by_customer RELY;
ALTER TABLE purchase ALTER CONSTRAINT ak_purchase RELY;

Now that we know how to declare constraints with RELY, let’s see how they perform.

Flexing the performance gainz

To demonstrate the effect of RELYable constraints on a centipede reporting view, I have replicated tables from Snowflake’s TPCH 10 sample data in two schemas — one, using standard constraints and the other, using RELY. However, before highlighting the performance improvements using the RELY property, I want to highlight the fundamental business value of the constraints themselves.

In this example, we will be using only eight tables. By real-world standards, this is tiny — to extend the metaphor, more fruit fly than centipede. But, even here, creating a multidimensional view requires understanding the entities involved and their relationships when joining the dimensions to the fact table and getting meaningful results instead of nulls or cartesian products. By defining those relationships as constraints, we can instantly identify details like what represents a unique record in this table (PK, AK), what is the common link between two entities (FK), or what columns are mandatory (NOT NULL). Furthermore, constraints also allow us to generate (and share) relational diagrams that provide this information in an easily digestible format:

Now that we understand the underlying schema, we can create the structure and populate it with TPCH sample data. For those that wish to follow along, the code used to test the RELY and NORELY version of the schema is available in the GitHub link at the end of the article.

With the tables created and loaded, we construct the centipede view, which encapsulates all the join conditions and the business logic used by the various teams in our organization. A peek into the view logic reveals the typical structure of a business-facing centipede:

A centipede wrapped in a view

Now, we run a SELECT from the centipede view, which only references columns from the fact table to see what happens (current_date() is used to avoid cached results).

use schema ops_norely;

SELECT current_date(), ship_mode, sum(quantity) quantity FROM main_reporting
WHERE TRUE
group by 1,2;

If you don’t know, you know now — centipedes bite!

A SELECT from an unconstrained multidimensional view

Now, we run the same query in a schema with RELY constraints:

use schema ops_rely;

SELECT current_date, ship_mode, sum(quantity) quantity FROM main_reporting
WHERE TRUE
group by 1,2;

The result — a constrained centipede:

Execution using join elimination

The RELY schema returns the result in less than half the time. Felipe Hoffa recently reported similar findings in his article (link below). Of course, depending on the complexity of the business logic, the performance gain could have been much higher.

There has to be a catch.

RELY but verify

Not all joins that reference columns from only one table are redundant. INNER JOINS return matching records between two tables and are often used as a filter instead of returning values from two tables. Snowflake uses the following example of a redundant join in its documentation (link below):

SELECT p.product_id, f.units_sold
FROM fact_sales f, dim_products p
WHERE f.product_id = p.product_id;

In this example, it seems logical that PRODUCT_IDs in FACT_SALES must also exist in DIM_PRODUCTS. But what happens when master data exists as a subset of itself (e.g., VIP_CUSTOMERS as a subset of CUSTOMERS)? To test this scenario, I created a MAIN_REGION table as a subset of REGION (excluding REGION_ID = 4) and set the RELY property on its PK.

I then attempted a seemingly redundant inner join on LOCATION to see if the records were filtered.

The Snowflake optimizer is smart enough to recognize the situation — it performs join elimination on a LEFT JOIN but not on INNER.

But I was determined to break something and couldn’t let the matter rest. So I created a bad LOCATION table that specifically RELY-ed on REGION_MAIN as an FK.

Here, I was able to force join elimination and obtain an incorrect result:

The takeaway is that the RELY property is an effortless and nearly risk-free way to optimize query performance on Snowflake views. Unless you’re trying to trick the optimizer, as I was, you have everything to gain and nothing to lose by declaring your constraints and setting the RELY property.

Conclusion

Using the RELY property, the Snowflake optimizer can leverage relational constraints to perform join elimination to improve query cost and performance. However, to advertise this functionality as redundant join elimination would be to miss its true potential.

Even in a normalized schema, business users depend on multidimensional views to find data in a format that they understand. In large organizations with diverse product lines and business units, views provide a cohesive wrapper to encapsulate complex business rules that provide a single source of truth for company-wide reporting. Master data that is redundant to one business unit may be essential to the operation of others.

Prior to the RELY option for join elimination, there were limited options for constraining the centipede: to materialize the view or take the performance hit. But now, Snowflake users can take the (performance) bite out of this previously intractable scenario.

While this article took an extreme example of the centipede fact table, any view is guaranteed to constantly see queries that don’t select columns from the underlying tables. Through join elimination, all such queries would benefit from faster runtime and reduced credit consumption.

Declaring relational constraints has always been a modeling best practice for the operational insight they provide. Thanks to the join elimination made possible through the RELY property, defining constraints now pays literal dividends in time and warehouse credits saved.

If you enjoyed constraining the centipede and want to learn more about data modeling and discover Snowflake-native recipes to help you save more warehouse credits, my book Data Modeling for Snowflake is due out in May and is available for pre-order on Amazon.

--

--

Serge Gershkovich
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

I am the author of "Data Modeling with Snowflake" and Product Success Lead at SqlDBM. I write about data modeling and cloud cost optimization.