Inside Snowflake — Optimizing the Query Optimizer: Redundant Semi-Joins and Anti-Joins

Shruti Sekaran is one of the amazing interns I met at Snowflake during the summer of 2022. I was immediately impressed by her goal: Improving the performance of some specific SQL patterns in Snowflake by “eliminating redundant semi-joins and anti-joins from queries”. Check this post for some SQL optimization tricks, and why you might not need them.

Felipe Hoffa
Snowflake
Published in
7 min readAug 23, 2022

--

Shruti and Felipe discussing query optimization at the Snowflake office

SQL quiz: Optimize this

Before we go deeper, check the following query. It scans 6 billion rows, while performing two semi-joins with sub-queries that go over 1.4 billion rows each. How would you optimize it?

select sum(l_discount), current_time()
from snowflake_sample_data.tpch_sf1000.lineitem
-- 6b rows
where l_quantity in (
select cr_return_quantity
from snowflake_sample_data.tpcds_sf10tcl.catalog_returns
-- 1.4b rows
where cr_return_quantity between 4 and 10
group by 1
)
and l_quantity in (
select cr_return_quantity
from snowflake_sample_data.tpcds_sf10tcl.catalog_returns
-- 1.4b rows
where cr_return_quantity between 2 and 100
group by 1
)

We’ll come back to this query later, but first let me introduce you to Shruti Sekaran — as her goal was to change the answer to this question, throughout her internship at Snowflake, Summer of 2022:

Shruti: Optimizing the Snowflake optimizer

Hi Shruti, before we talk about SQL details — How did you get this internship? What are you studying?

I’m a grad student studying Computer Science at Arizona State University. A friend from undergrad works here as a software engineer. He loves it here and suggested that I apply, giving me a referral. Having been really interested in distributed systems and database internals, this seemed to be the perfect start. So, I jumped at this opportunity, and as they say, the rest is history!

Let’s talk about SQL optimization — what was the goal of your internship?

I have been working in the SQL team, particularly its core Optimizer. My project was to perform optimizations in the query plans that would eliminate redundant Semi-Joins and Anti-Joins from queries, thereby improving execution time and performance for such queries. For this elimination, I worked on analyzing subsumption of general filters, join filters and group by aggregates.

Can you show me some example queries?

Sure, a straightforward case would be something like:

select * 
from t1
where a in (
select *
from t2
where b < 5
and t1.a > 5
)
and a in (
select *
from t2
where b < 3
and t1.a > 3
)

Here, the first subquery with filters b<5 and t1.a>5 is redundant, as b<5 subsumes b<3 and similarly t1.a>5 subsumes t1.a>3. So after my optimization, the query executed is effectively:

select * 
from t1
where a in (
select *
from t2
where b < 3
and t1.a > 3
)

Seems pretty simple, right? But the optimizations can get tricky cases like:

with cte as (
select *
from t2
where b < 100
)
select count(*)
from t1
where a in (
select b
from cte
where b < 5
)
and a in (
select b
from cte
where b < 50
)
and b not in (
select b
from cte
where b < 30
)

where a in (select b from cte where b < 50) is eliminated.

Then in:

select count(*) 
from t2
where a not in (
select min(a)
from t1
group by b
)
and a not in (
select min(a)
from t1
group by b, c
)

where a not in (select min(a) from t1 group by b) is eliminated.

Shruti explaining query optimization

I love it — how much impact can these optimizations have?

I haven’t gotten a chance to test this extensively on customer queries yet, but in some of my sample queries, I’ve seen the execution performance improve 2X times.

This because it’s dropping a certain (redundant) plan fragment from the plan. If it identifies one, then it’s at most 2x perf improvement. If it ends up eliminating 2 redundant fragments, it’ll be a 3x gain, if it’s 3 fragments then 4x, etc.

So these are very specific queries that will now run faster thanks to your work — were you part of a bigger team making Snowflake faster for everyone?

Yes, like I said, mine is the Compiler team, part of the DB Engineering org. Everyone on it is amazing! Extremely smart (lots of PhDs in databases), from very diverse backgrounds, all of them super warm and helpful. In the beginning, I was slightly intimidated, because my mentor and most others are very senior, but they are so supportive and patient that I became comfortable very soon and will be taking away many lessons from my time with them.

Overall, how was your experience as a Snowflake intern?

It was an exciting, delightful learning experience. I got to work in the very core of Snowflake; its compiler which was built from scratch and have my work go into production as well, and see for myself the performance improvement it brings! So now, I not only have an in-depth, practical understanding of how a database compiler and optimizer work, but also on how features are rolled out and production occurs in such rapid, yet robust release cycles.

At the same time, I’ve had a lot of fun, during playing board games & mixing cocktails, and on Cheesy Thursdays and Waffle Wednesdays, and many more.

What was your favorite part of this internship?

That’s a tough one… I would say my favorite part was how well rounded this whole internship has been. I mean, the project has been a great learning opportunity, as well as been impactful. Similarly, meeting and working with such diverse people! From interns to senior engineers, their different cultures and experiences have broadened my exposure too. Additionally, with various office and intern events, from ice cream trucks at office to an SF giants game, this whole journey has been extremely rewarding and memorable!

A flock of Snowflake interns at the Giants game

Any last words for anyone considering joining Snowflake as an intern, or full time Software Engineer?

I would wholeheartedly advise them to go for it! As a fresher, I think Snowflake had everything I was hoping for in an internship, and much more. If you’re someone who wants to work on interesting impactful challenges, while learning and having fun with super cool people everyday, then this is the place for you!

Query optimized

Let’s go back to the opening query, and how to optimize it.

First, if you look at the optimizations that Shruti developed and will soon be in production across Snowflake — there’s no manual optimization that you need to perform, the SQL compiler will take care of it. But in the meantime the answer is “remove the redundant semi-join”.

I was able to reduce the query runtime from 14s to 11s on a S-WH by removing the redundant semi-join, with this equivalent query:

select sum(l_discount), current_time()
from snowflake_sample_data.tpch_sf1000.lineitem
-- 6b rows
where l_quantity in (
select cr_return_quantity
from snowflake_sample_data.tpcds_sf10tcl.catalog_returns
-- 1.4b rows
where cr_return_quantity between 4 and 10
group by 1
)

And these are screenshots of the query profiles, so you can compare the execution trees:

SQL execution tree simplified by removing a redundant semi-join

If you try the same query expecting the same results, you might soon be surprised by Snowflake optimizing this redundant semi-joins by itself. No human intervention required.

In fact, some of these optimizations are already at work. You might have noticed that the queries have a group by 1 within the sub-queries, as this is a traditional optimization in sub-queries like this: Instead of an output of potentially thousands of identical rows for the following IN() join, you can ask the SQL optimizer to get out the set of different values. But that optimization is not needed in Snowflake already. In fact, as I removed the group by from all the above queries, I got both versions to return in 11s and 10s:

SQL execution tree simplified by removing a redundant semi-join — execution time almost equal without the sub-queries `group by 1`. Both trees look almost identical, as Snowflake is already taking care of optimizing redundant semi-joins.

This shows that my old optimization trick “reduce semi-joins to only distinct values” now was making the queries slower — as the Snowflake compiler can take care of optimizations that previously you had to pass on throughout generations of communal knowledge.

As Shruti experienced, the compiler team at Snowflake is working hard to make these optimizations happen magically to all Snowflake users — just keep expecting more, faster, and better.

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Felipe Hoffa
Snowflake

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.