How to help Snowflake eliminate your redundant joins with constraints
Redundant join elimination is a new feature that helps some SQL queries run way faster. To get these benefits, you will need to give some hints to Snowflake. Let’s figure out these constraints here.
--
A quick review of redundant join elimination
Sometimes people (and tools) write joins in SQL that end up being redundant. If eliminated these queries could run significantly faster, saving time and complexity. Snowflake is ready to do this automatically, but to do this safely Snowflake might need some extra information from you.
For example, let’s look at this query based on the tables of TPC-H-SF10:
select l_orderkey, sum(l_extendedprice*l_quantity), current_timestamp()
from lineitem_10
join orders_10 on l_orderkey = o_orderkey
group by 1
order by 2 desc
limit 10
You can see that there’s a join
there to ensure that every row in lineitem
belongs to an order in orders
. If you could guarantee that’s already the case, then Snowflake could eliminate that join automatically, and run the query in half the time.
So how can you tell Snowflake that the constraints within your tables and data can be trusted?
That’s the experiment we’ll run on the rest of this post.
Setting up data
First let’s copy data out of TPC-H-SF10 into our own schema — so we can play with its constraints. Here’s how to copy the tables lineitem
and orders
. I added a order by
as a usual practice to make sure the data is well micro partitioned.
create table orders_10
as select *
from snowflake_sample_data.tpch_sf10.orders
order by o_orderdate
-- 8.9s S
;
create table lineitem_10
as select *
from snowflake_sample_data.tpch_sf10.lineitem
order by l_shipdate
-- 24s S
;
Now let’s run our baseline query:
select l_orderkey, sum(l_extendedprice*l_quantity)
, current_timestamp() -- avoids using the query results cache
from lineitem_10
join orders_10 on l_orderkey = o_orderkey
group by 1
order by 2 desc
limit 10
-- 5.2s
;
5.2 seconds on a Small warehouse — now we can help Snowflake optimize it by adding constraints.
Adding constraints
Constraints like primary and foreign keys are common in the OLTP database world. You can set them in Snowflake, but most people don’t bother as Snowflake doesn’t enforce them nor uses them to optimize. That’s changing, as with the upcoming hybrid tables you’ll get enforcement of primary keys — and with redundant join elimination, Snowflake can use your constraints to optimize.
This is how we can add a primary key to the table orders
, and then we can add a foreign key constraint to lineitem
to show Snowflake the relationship between these tables:
alter table orders_10
add primary key (o_orderkey)
;
alter table lineitem_10
add constraint lineitem_fkey_orders foreign key (l_orderkey)
references orders_10 (o_orderkey)
;
And that could be enough, but it turns out it’s not. Run the query again, and you’ll find the same execution graph:
Making sure Snowflake can rely on the constraints
So why did the constraints between these tables didn’t help Snowflake optimize this join? Turns out you also need to tell Snowflake it can rely
on these constraints. To do so:
alter table lineitem_10 alter constraint lineitem_fkey_orders rely;
alter table orders_10 alter primary key rely;
That’s it: Now our query runs in less than half of the time, with a much simpler execution graph:
If you want to check the state of your constraints — you can find that information on your information_schema
:
select constraint_name, table_name, constraint_type, enforced, rely
from information_schema.table_constraints
where table_name IN ('LINEITEM_10', 'ORDERS_10')
;
Additional notes
By the way, our base query:
select l_orderkey, sum(l_extendedprice*l_quantity)
, current_timestamp()
from lineitem_10
join orders_10 on l_orderkey = o_orderkey
group by 1
order by 2 desc
limit 10
Is very similar in intentions to one like:
select l_orderkey, sum(l_extendedprice*l_quantity)
, current_timestamp()
from lineitem_10
where l_orderkey in (select o_orderkey from orders_10)
group by 1
order by 2 desc
limit 10
But Snowflake did not eliminate that redundant semi-join in my experiments. Don’t worry too much about it: The team is aware and working on it.
Next steps
- Investigate how constraints can help in your queries — and share your results with.
- Read “Constraining the centipede” by Serge Gershkovich.
- Introduce data quality testing into your project, to make sure
unique
andprimary keys
restrictions are reliable for the query optimizer. - If you are using dbt, the package constraints can help with setting these constraints.
although Snowflake doesn’t enforce most constraints, the query optimizer can consider primary key, unique key, and foreign key constraints during query rewrite if the constraint is set to RELY. Since dbt can test that the data in the table complies with the constraints, this package creates constraints on Snowflake with the RELY property to improve query performance
- Check out the docs for join elimination to learn more — which include examples for “Eliminating an Unnecessary Left Outer Join”, “Eliminating an Unnecessary Self-Join”, and “Eliminating an Unnecessary Join on a Primary Key and Foreign Key”.
Want more?
- Try this out with a Snowflake free trial account — you only need an email address to get started.
I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.