Published in

Snowflake

# SQL quiz: Optimize this

Before we go deeper, check the following query. It scans 6 billion rows, while performing two semi-joins with sub-queries that go over 1.4 billion rows each. How would you optimize it?

`select sum(l_discount), current_time()from snowflake_sample_data.tpch_sf1000.lineitem -- 6b rowswhere l_quantity in (    select cr_return_quantity     from snowflake_sample_data.tpcds_sf10tcl.catalog_returns     -- 1.4b rows    where cr_return_quantity between 4 and 10    group by 1)and l_quantity in (    select cr_return_quantity     from snowflake_sample_data.tpcds_sf10tcl.catalog_returns     -- 1.4b rows    where cr_return_quantity between 2 and 100     group by 1)`

We’ll come back to this query later, but first let me introduce you to Shruti Sekaran — as her goal was to change the answer to this question, throughout her internship at Snowflake, Summer of 2022:

# Shruti: Optimizing the Snowflake optimizer

I’m a grad student studying Computer Science at Arizona State University. A friend from undergrad works here as a software engineer. He loves it here and suggested that I apply, giving me a referral. Having been really interested in distributed systems and database internals, this seemed to be the perfect start. So, I jumped at this opportunity, and as they say, the rest is history!

I have been working in the SQL team, particularly its core Optimizer. My project was to perform optimizations in the query plans that would eliminate redundant Semi-Joins and Anti-Joins from queries, thereby improving execution time and performance for such queries. For this elimination, I worked on analyzing subsumption of general filters, join filters and group by aggregates.

Sure, a straightforward case would be something like:

`select * from t1 where a in (  select *   from t2   where b < 5   and t1.a > 5) and a in (  select *   from t2   where b < 3   and t1.a > 3)`

Here, the first subquery with filters `b<5` and `t1.a>5` is redundant, as `b<5` subsumes `b<3` and similarly `t1.a>5` subsumes` t1.a>3`. So after my optimization, the query executed is effectively:

`select * from t1 where a in (  select *   from t2   where b < 3   and t1.a > 3)`

Seems pretty simple, right? But the optimizations can get tricky cases like:

`with cte as (  select *   from t2   where b < 100)select count(*) from t1 where a in (  select b   from cte   where b < 5) and a in (  select b   from cte   where b < 50) and b not in (  select b   from cte   where b < 30)`

where `a in (select b from cte where b < 50)` is eliminated.

Then in:

`select count(*) from t2 where a not in (  select min(a)   from t1   group by b) and a not in (  select min(a)   from t1   group by b, c)`

where `a not in (select min(a) from t1 group by b)` is eliminated.

I haven’t gotten a chance to test this extensively on customer queries yet, but in some of my sample queries, I’ve seen the execution performance improve 2X times.

This because it’s dropping a certain (redundant) plan fragment from the plan. If it identifies one, then it’s at most 2x perf improvement. If it ends up eliminating 2 redundant fragments, it’ll be a 3x gain, if it’s 3 fragments then 4x, etc.

Yes, like I said, mine is the Compiler team, part of the DB Engineering org. Everyone on it is amazing! Extremely smart (lots of PhDs in databases), from very diverse backgrounds, all of them super warm and helpful. In the beginning, I was slightly intimidated, because my mentor and most others are very senior, but they are so supportive and patient that I became comfortable very soon and will be taking away many lessons from my time with them.

It was an exciting, delightful learning experience. I got to work in the very core of Snowflake; its compiler which was built from scratch and have my work go into production as well, and see for myself the performance improvement it brings! So now, I not only have an in-depth, practical understanding of how a database compiler and optimizer work, but also on how features are rolled out and production occurs in such rapid, yet robust release cycles.

At the same time, I’ve had a lot of fun, during playing board games & mixing cocktails, and on Cheesy Thursdays and Waffle Wednesdays, and many more.

That’s a tough one… I would say my favorite part was how well rounded this whole internship has been. I mean, the project has been a great learning opportunity, as well as been impactful. Similarly, meeting and working with such diverse people! From interns to senior engineers, their different cultures and experiences have broadened my exposure too. Additionally, with various office and intern events, from ice cream trucks at office to an SF giants game, this whole journey has been extremely rewarding and memorable!

I would wholeheartedly advise them to go for it! As a fresher, I think Snowflake had everything I was hoping for in an internship, and much more. If you’re someone who wants to work on interesting impactful challenges, while learning and having fun with super cool people everyday, then this is the place for you!

# Query optimized

Let’s go back to the opening query, and how to optimize it.

First, if you look at the optimizations that Shruti developed and will soon be in production across Snowflake — there’s no manual optimization that you need to perform, the SQL compiler will take care of it. But in the meantime the answer is “remove the redundant semi-join”.

I was able to reduce the query runtime from 14s to 11s on a S-WH by removing the redundant semi-join, with this equivalent query:

`select sum(l_discount), current_time()from snowflake_sample_data.tpch_sf1000.lineitem -- 6b rowswhere l_quantity in (    select cr_return_quantity     from snowflake_sample_data.tpcds_sf10tcl.catalog_returns     -- 1.4b rows     where cr_return_quantity between 4 and 10    group by 1)`

And these are screenshots of the query profiles, so you can compare the execution trees:

If you try the same query expecting the same results, you might soon be surprised by Snowflake optimizing this redundant semi-joins by itself. No human intervention required.

In fact, some of these optimizations are already at work. You might have noticed that the queries have a `group by 1` within the sub-queries, as this is a traditional optimization in sub-queries like this: Instead of an output of potentially thousands of identical rows for the following `IN()` join, you can ask the SQL optimizer to get out the set of different values. But that optimization is not needed in Snowflake already. In fact, as I removed the `group by` from all the above queries, I got both versions to return in 11s and 10s:

This shows that my old optimization trick “reduce semi-joins to only distinct values” now was making the queries slower — as the Snowflake compiler can take care of optimizations that previously you had to pass on throughout generations of communal knowledge.

As Shruti experienced, the compiler team at Snowflake is working hard to make these optimizations happen magically to all Snowflake users — just keep expecting more, faster, and better.

# Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

## More from Snowflake

Snowflake articles from engineers using Snowflake to power their data.

## Get the Medium app

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.