Turbo-charge your Data Model with Snowflake’s Join Elimination

Its an age-old question that arises when data modeling for analytic environments: How do we strike the proper balance between flexibility and performance?

Data warehouses and other analytic data platforms often have a strategic goal of consolidating applications and workloads thereby eliminating data silos. To achieve this, we must implement a data model that is flexible enough to support the goals of multiple applications and workloads. If we bias a model to the access requirements of one piece of the workload, we will ultimately compromise performance goals of other workloads. This could lead to the creation of duplicate models for other workloads which gets us back into the data silo business. So not a good move.

The flip side is we build a model that is application neutral and provides maximum flexibility to support the functional requirements of all applications. 3NF or Data Vault approaches are good choices here. But we still need to solve for performance. Flexible, enterprise-wide data models are often highly complex containing hundreds of entities and relationships. The queries executing against this model will tend to be complex requiring many joins and will often be complex to navigate for the typical user even when equipped with great BI tools. And lots of joins can result in, at best, variable performance. Even with databases like Snowflake, which has a great cost-based optimizer, not doing a join is typically more efficient than doing a join.

So, if we’re operating in a Snowflake environment, how then do we answer this question and strike the proper balance between performance and flexibility? One way is to implement a data model that is relatively application-neutral. This doesn’t need to be pure 3NF but something that can support multiple applications, multiple access paths with some reasonable compromises to achieve performance. These compromises can include some tactical or limited concessions on flexibility such as denormalizations. Some industry data models are highly normalized and have large entities (bridge or associative entities) that only exist to satisfy the rules of 3NF but have limited business value to the consumer. We can determine if these are necessary and remove them to simply the model if it makes sense to do so.

We can also employ some of the internal Snowflake mechanisms to optimize query performance such as:

  • Warehouse size
  • Cluster Keys
  • Search Optimization
  • Query Acceleration
  • Materialized Views
  • Join Elimination ← NEW!

The purpose of this posting is to illustrate how Join Elimination can help us take a complex data model and present a more simplified, semantic view to consumers that will give us the performance we need without compromising flexibility in our data foundation.

So, what exactly is Join Elimination? If you have a query with one or more joins, the Snowflake query optimizer can drop tables (joins) from the query plan if certain conditions are met:

  • No columns from the joined table(s) are referenced in the query except in the JOIN clause.
  • Constraints are defined between the joined tables using the RELY property (more on this later).

Let’s take a look at some simple examples of how this works to eliminate joins and improve query performance.

Consider we have a SALES table and an EMPLOYEE table. SALES is a fact table containing metrics for a sales event and EMPLOYEE is a list of EMPLOYEE information inclusive of the sales organization. EMPLOYEE has a primary key of EMP_ID. SALES includes EMP_ID as a foreign key.

So, we have a PK->FK relationship between SALES and EMPLOYEE and a simple query like this will execute a join between the two tables:

SELECT Sales.*, Employee.* 
FROM Sales, Employee
WHERE Sales.EMP_ID = Employee.EMP_ID;

Now let’s look at a simple variation of the same query that only projects values from SALES.

SELECT Sales.*
FROM Sales, Employee
WHERE Sales.EMP_ID = Employee.EMP_ID;

In this case it appears the join is unnecessary since we are not projecting any columns from the EMPLOYEE or referencing its columns anywhere except the join condition. If we removed EMPLOYEE from the query, we might get the same result. Or we might not. Since this is an inner join all SALES rows must have a corresponding row in EMPLOYEE with the same EMP_ID value. If not, the SALES rows will drop out.

The key to Join Elimination is to essentially give the optimizer permission to drop unnecessary tables from the query plan if we can guarantee that the integrity of the join is being maintained by other means. In this example, if we have validated via our ETL or Data Pipeline processes that we have the proper join integrity between SALES and EMPLOYEE, we can safely allow the optimizer to drop EMPLOYEE from the query. In Snowflake, we do this with constraints.

ALTER TABLE Employee ADD CONSTRAINT <constraint name> PRIMARY KEY (EMP_ID) RELY; 
ALTER TABLE Sales ADD CONSTRAINT <constraint name> FOREIGN KEY (EMP_ID)
REFERENCES Employee(EMP_ID) RELY;

The first statement defines a PRIMARY KEY for the EMPLOYEE table

The second statement defines a FOREIGN KEY relationship between SALES and EMPLOYEE.

In Snowflake, these constraints are not enforced and had no impact on query operations. But when we add the RELY property at the end, this tells the optimizer it can safely apply Join Elimination on the queries if the table referenced by the FK is has no participating columns. Depending on the table sizes, demographics, and other factors, dropping a join can significantly improve the performance of the query.

Consider the execution of the simple 2-table example provided above:

Without Join Elimination
With Join Elimination

With Join Elimination(JE) in place, we can see that we avoid doing the join and save 5.1% of the query cost as compared with the original tables. While this may not seem very dramatic, the effect of Join Elimination gets more significant as more tables are dropped from the query plan. In this example, almost 90% of the query cost was spent scanning the SALES table which needed to happen even with JE in place. As more tables are added to the query and then eliminated with JE, the scan becomes a smaller percentage of the overall query cost. The impact of JE is thus magnified.

What we’ve illustrated in this example is the PK-FK variation of Join Elimination. There are two more use cases that are covered in this initial release. These include Self Join and Left Outer Join.

In the Self Join, Snowflake can apply transitive closure to the query so a join can be dropped. So the query

SELECT A.emp_id, B.emp_id
FROM employee_je A, employee_je B
WHERE A.emp_id = B.emp_id;

can be re-written by the Snowflake optimizer as:

SELECT A.emp_id, A.emp_id
FROM employee_je A;

The join condition A.emp_id = B.emp_id implies that we can substitute B.emp_id with A.emp_id in the projection thereby allowing Join Elimination to drop the join to table B.

Similarly, for a left outer-join query like:

SELECT A.*
FROM Sales A
LEFT OUTER JOIN Employee_je B
ON A.emp_id = B.emp_id;

The join is unnecessary when we have the PK constraint in place on Employee_je. No need for the FK constraint on the Sales table for this example.

Now that we understand how Join Elimination works, let’s apply it to solving our data model problem discussed earlier. Where some will take a complex data model and build materialized structures on top for performance (avoid joins) or to present a more simplified view of the data to consumers, Join Elimination can allow us to keep the base model flexible and application-neutral without incurring significant performance penalties.

Consider a more complex (but still relatively simple) data model for retail:

Here we have a Sales fact table surrounded by 6 dimensions. To avoid exposing 7 tables to the consumer and letting them worry about how to construct the joins, we could have materialized a single, very wide, denormalized table. While this would make sense from a consumer point of view, there is cost and complexity associated with this approach. We are storing data twice: once in the base, and again in the denormalized table. We also need to build and maintain the process that populates and maintains this table. This introduces latency between the denormalized table and the base model. Change management also becomes more difficult since any change to the rows in the dimensions will result in new rows in the denormalized table. This could result in exploding storage costs even more if the change rate is high.

With Join Elimination, we can replace the denormalized table with a view that has the same structure and performs all the joins but with all CONSTRAINTS and RELY property specified for each relationship between Sales and the Dimension. Users can then query the view and be assured that only the dimensions required by the user query against the view will be included as joins. For queries that do require a lot of the tables in the view definition, the denormalized table may perform better. But for queries that only need a small percentage of the tables, Join Elimination can reduce the number of joins thereby reducing query complexity and providing better overall performance. This can effectively achieve our goal of balancing a flexible, workload-neutral data model against the need to present a simplified semantic perspective to the consumers of the model.

Conclusion

Snowflake’s Join Elimination is a feature that is very easy to understand and to implement. When applied along with other Snowflake performance features and best practices, it can help strike that balance in our data model between flexibility and performance. This is important in creating and managing a effective data foundation that meets the performance goals against a wide range of workloads and use cases while providing consumers with a simplified and intuitive view of their data without creating additional silos. Best of all, it’s now officially GA and ready to use!

You can find more information on this feature here:

I hope you found this article useful. If you enjoyed it and would like additional info, please follow me on my profile or hit me on social media at LinkedIn.

All information and opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

--

--

Tom Manfredi
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Scalable Data Warehouse/Lake Architectures, Snowflake, Teradata, Performance Optimization, Snowflake SnowPro Core Certification, Teradata Certified Master