Snowflake
Published in

Snowflake

Inside Snowflake — Optimizing the Query Optimizer: Redundant Semi-Joins and Anti-Joins

Shruti Sekaran is one of the amazing interns I met at Snowflake during the summer of 2022. I was immediately impressed by her goal: Improving the performance of some specific SQL patterns in Snowflake by “eliminating redundant semi-joins and anti-joins from queries”. Check this post for some SQL optimization tricks, and why you might not need them.

Shruti and Felipe discussing query optimization at the Snowflake office

SQL quiz: Optimize this

Before we go deeper, check the following query. It scans 6 billion rows, while performing two semi-joins with sub-queries that go over 1.4 billion rows each. How would you optimize it?

select sum(l_discount), current_time()
from snowflake_sample_data.tpch_sf1000.lineitem
-- 6b rows
where l_quantity in (
select cr_return_quantity
from snowflake_sample_data.tpcds_sf10tcl.catalog_returns
-- 1.4b rows
where cr_return_quantity between 4 and 10
group by 1
)
and l_quantity in (
select cr_return_quantity
from snowflake_sample_data.tpcds_sf10tcl.catalog_returns
-- 1.4b rows
where cr_return_quantity between 2 and 100
group by 1
)

We’ll come back to this query later, but first let me introduce you to Shruti Sekaran — as her goal was to change the answer to this question, throughout her internship at Snowflake, Summer of 2022:

Shruti: Optimizing the Snowflake optimizer

I’m a grad student studying Computer Science at Arizona State University. A friend from undergrad works here as a software engineer. He loves it here and suggested that I apply, giving me a referral. Having been really interested in distributed systems and database internals, this seemed to be the perfect start. So, I jumped at this opportunity, and as they say, the rest is history!

I have been working in the SQL team, particularly its core Optimizer. My project was to perform optimizations in the query plans that would eliminate redundant Semi-Joins and Anti-Joins from queries, thereby improving execution time and performance for such queries. For this elimination, I worked on analyzing subsumption of general filters, join filters and group by aggregates.

Sure, a straightforward case would be something like:

select * 
from t1
where a in (
select *
from t2
where b < 5
and t1.a > 5
)
and a in (
select *
from t2
where b < 3
and t1.a > 3
)

Here, the first subquery with filters b<5 and t1.a>5 is redundant, as b<5 subsumes b<3 and similarly t1.a>5 subsumes t1.a>3. So after my optimization, the query executed is effectively:

select * 
from t1
where a in (
select *
from t2
where b < 3
and t1.a > 3
)

Seems pretty simple, right? But the optimizations can get tricky cases like:

with cte as (
select *
from t2
where b < 100
)
select count(*)
from t1
where a in (
select b
from cte
where b < 5
)
and a in (
select b
from cte
where b < 50
)
and b not in (
select b
from cte
where b < 30
)

where a in (select b from cte where b < 50) is eliminated.

Then in:

select count(*) 
from t2
where a not in (
select min(a)
from t1
group by b
)
and a not in (
select min(a)
from t1
group by b, c
)

where a not in (select min(a) from t1 group by b) is eliminated.

Shruti explaining query optimization

I haven’t gotten a chance to test this extensively on customer queries yet, but in some of my sample queries, I’ve seen the execution performance improve 2X times.

This because it’s dropping a certain (redundant) plan fragment from the plan. If it identifies one, then it’s at most 2x perf improvement. If it ends up eliminating 2 redundant fragments, it’ll be a 3x gain, if it’s 3 fragments then 4x, etc.

Yes, like I said, mine is the Compiler team, part of the DB Engineering org. Everyone on it is amazing! Extremely smart (lots of PhDs in databases), from very diverse backgrounds, all of them super warm and helpful. In the beginning, I was slightly intimidated, because my mentor and most others are very senior, but they are so supportive and patient that I became comfortable very soon and will be taking away many lessons from my time with them.

It was an exciting, delightful learning experience. I got to work in the very core of Snowflake; its compiler which was built from scratch and have my work go into production as well, and see for myself the performance improvement it brings! So now, I not only have an in-depth, practical understanding of how a database compiler and optimizer work, but also on how features are rolled out and production occurs in such rapid, yet robust release cycles.

At the same time, I’ve had a lot of fun, during playing board games & mixing cocktails, and on Cheesy Thursdays and Waffle Wednesdays, and many more.

That’s a tough one… I would say my favorite part was how well rounded this whole internship has been. I mean, the project has been a great learning opportunity, as well as been impactful. Similarly, meeting and working with such diverse people! From interns to senior engineers, their different cultures and experiences have broadened my exposure too. Additionally, with various office and intern events, from ice cream trucks at office to an SF giants game, this whole journey has been extremely rewarding and memorable!

A flock of Snowflake interns at the Giants game

I would wholeheartedly advise them to go for it! As a fresher, I think Snowflake had everything I was hoping for in an internship, and much more. If you’re someone who wants to work on interesting impactful challenges, while learning and having fun with super cool people everyday, then this is the place for you!

Query optimized

Let’s go back to the opening query, and how to optimize it.

First, if you look at the optimizations that Shruti developed and will soon be in production across Snowflake — there’s no manual optimization that you need to perform, the SQL compiler will take care of it. But in the meantime the answer is “remove the redundant semi-join”.

I was able to reduce the query runtime from 14s to 11s on a S-WH by removing the redundant semi-join, with this equivalent query:

select sum(l_discount), current_time()
from snowflake_sample_data.tpch_sf1000.lineitem
-- 6b rows
where l_quantity in (
select cr_return_quantity
from snowflake_sample_data.tpcds_sf10tcl.catalog_returns
-- 1.4b rows
where cr_return_quantity between 4 and 10
group by 1
)

And these are screenshots of the query profiles, so you can compare the execution trees:

SQL execution tree simplified by removing a redundant semi-join

If you try the same query expecting the same results, you might soon be surprised by Snowflake optimizing this redundant semi-joins by itself. No human intervention required.

In fact, some of these optimizations are already at work. You might have noticed that the queries have a group by 1 within the sub-queries, as this is a traditional optimization in sub-queries like this: Instead of an output of potentially thousands of identical rows for the following IN() join, you can ask the SQL optimizer to get out the set of different values. But that optimization is not needed in Snowflake already. In fact, as I removed the group by from all the above queries, I got both versions to return in 11s and 10s:

SQL execution tree simplified by removing a redundant semi-join — execution time almost equal without the sub-queries `group by 1`. Both trees look almost identical, as Snowflake is already taking care of optimizing redundant semi-joins.

This shows that my old optimization trick “reduce semi-joins to only distinct values” now was making the queries slower — as the Snowflake compiler can take care of optimizations that previously you had to pass on throughout generations of communal knowledge.

As Shruti experienced, the compiler team at Snowflake is working hard to make these optimizations happen magically to all Snowflake users — just keep expecting more, faster, and better.

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.