How to help Snowflake eliminate your redundant joins with constraints

Redundant join elimination is a new feature that helps some SQL queries run way faster. To get these benefits, you will need to give some hints to Snowflake. Let’s figure out these constraints here.

Image generated by AI

A quick review of redundant join elimination

Sometimes people (and tools) write joins in SQL that end up being redundant. If eliminated these queries could run significantly faster, saving time and complexity. Snowflake is ready to do this automatically, but to do this safely Snowflake might need some extra information from you.

For example, let’s look at this query based on the tables of TPC-H-SF10:

select l_orderkey, sum(l_extendedprice*l_quantity), current_timestamp()
from lineitem_10
join orders_10 on l_orderkey = o_orderkey
group by 1
order by 2 desc
limit 10

You can see that there’s a join there to ensure that every row in lineitem belongs to an order in orders. If you could guarantee that’s already the case, then Snowflake could eliminate that join automatically, and run the query in half the time.

So how can you tell Snowflake that the constraints within your tables and data can be trusted?

That’s the experiment we’ll run on the rest of this post.

Image generated by AI

Setting up data

First let’s copy data out of TPC-H-SF10 into our own schema — so we can play with its constraints. Here’s how to copy the tables lineitem and orders. I added a order by as a usual practice to make sure the data is well micro partitioned.

create table orders_10
as select *
from snowflake_sample_data.tpch_sf10.orders
order by o_orderdate
-- 8.9s S
;
create table lineitem_10
as select *
from snowflake_sample_data.tpch_sf10.lineitem
order by l_shipdate
-- 24s S
;

Now let’s run our baseline query:

select l_orderkey, sum(l_extendedprice*l_quantity)
, current_timestamp() -- avoids using the query results cache
from lineitem_10
join orders_10 on l_orderkey = o_orderkey
group by 1
order by 2 desc
limit 10
-- 5.2s
;
The execution graph shows a join between tables

5.2 seconds on a Small warehouse — now we can help Snowflake optimize it by adding constraints.

Adding constraints

Constraints like primary and foreign keys are common in the OLTP database world. You can set them in Snowflake, but most people don’t bother as Snowflake doesn’t enforce them nor uses them to optimize. That’s changing, as with the upcoming hybrid tables you’ll get enforcement of primary keys — and with redundant join elimination, Snowflake can use your constraints to optimize.

This is how we can add a primary key to the table orders, and then we can add a foreign key constraint to lineitem to show Snowflake the relationship between these tables:

alter table orders_10 
add primary key (o_orderkey)
;
alter table lineitem_10
add constraint lineitem_fkey_orders foreign key (l_orderkey)
references orders_10 (o_orderkey)
;

And that could be enough, but it turns out it’s not. Run the query again, and you’ll find the same execution graph:

The execution graph still shows a join between tables

Making sure Snowflake can rely on the constraints

So why did the constraints between these tables didn’t help Snowflake optimize this join? Turns out you also need to tell Snowflake it can rely on these constraints. To do so:

alter table lineitem_10 alter constraint lineitem_fkey_orders rely;
alter table orders_10 alter primary key rely;

That’s it: Now our query runs in less than half of the time, with a much simpler execution graph:

The execution graph shows the join has been automatically eliminated for faster results

If you want to check the state of your constraints — you can find that information on your information_schema:

select constraint_name, table_name, constraint_type, enforced, rely
from information_schema.table_constraints
where table_name IN ('LINEITEM_10', 'ORDERS_10')
;
Existing constrains in these tables, including being “reliable”

Additional notes

By the way, our base query:

select l_orderkey, sum(l_extendedprice*l_quantity)
, current_timestamp()
from lineitem_10
join orders_10 on l_orderkey = o_orderkey
group by 1
order by 2 desc
limit 10

Is very similar in intentions to one like:

select l_orderkey, sum(l_extendedprice*l_quantity)
, current_timestamp()
from lineitem_10
where l_orderkey in (select o_orderkey from orders_10)
group by 1
order by 2 desc
limit 10

But Snowflake did not eliminate that redundant semi-join in my experiments. Don’t worry too much about it: The team is aware and working on it.

Next steps

  • Introduce data quality testing into your project, to make sure unique and primary keys restrictions are reliable for the query optimizer.
  • If you are using dbt, the package constraints can help with setting these constraints.

although Snowflake doesn’t enforce most constraints, the query optimizer can consider primary key, unique key, and foreign key constraints during query rewrite if the constraint is set to RELY. Since dbt can test that the data in the table complies with the constraints, this package creates constraints on Snowflake with the RELY property to improve query performance

Image generated by AI

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Felipe Hoffa
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.