Deep performance analysis with the new “query operator stats” in Snowflake

Snowflake users can finally dig deep into query stats with this new function. This data was only available in the query profile visualization, and now you can get it in your SQL world with GET_QUERY_OPERATOR_STATS(). Let’s review the new function, and how to use it to get deeper insights while benchmarking.

Felipe Hoffa
Snowflake
Published in
10 min readFeb 7

--

Image generated by AI

The old ways

Previously you could find deep stats only on the query profile visualization UI:

Snowflake classic query profile interactive visualization

This is a pretty cool view that can help you understand the performance details of each query: How many steps it took, how many rows where the input and output of each step, which step took the most time, how much data was scanned, etc.

What’s new: GET_QUERY_OPERATOR_STATS()

But people wanted more: Many users where not happy getting everything in a cool interactive visualization, they wanted the same data in a table full of columns and numbers.

Snowflake listened. And Snowflake now has the new get_query_operator_stats() function. Check out the docs and the announcement blog:

To get the stats for one query, you just need to run:

select *
from table(get_query_operator_stats(last_query_id()));

That gives you a detailed table with all the stats of the previous query:

Query stats for a random query

What’s missing: Going beyond one query

While that’s pretty cool, now people are asking how to go beyond one query. They don’t want to go one by one, they want to get the stats for all their queries. So let’s use the rest of this post to find out a way to do so.

Image generated by AI

Using the new query stats to analyze a whole run of TPCH benchmarks

Setup

Our friend Mimoune Djouallah has been running benchmarks of several dbs, and I took his code to run TPC-H-SF10 over the data that Snowflake already provides on your account. Just make sure to use schema snowflake_sample_data.tpch_sf10 for these queries run.

Then I ran these benchmark queries 4 times: Twice on a Small warehouse, and twice on a XL warehouse.

After running these queries, I used the following query to gather all the basic stats:

create or replace table query_sessions2
as
select *, sum(is_session_start) over(order by start_time) session_id
from (
select query_id, query_text, query_type, warehouse_size, start_time, total_elapsed_time
, bytes_scanned, rows_produced, compilation_time, execution_time
, queued_provisioning_time, query_tag
, regexp_substr(query_text, '--Query(..)', 1,1,'',1)::int qid
, iff(qid < lag(qid) over(order by start_time) ,1,0) is_session_start
from table(information_schema.query_history(result_limit=>1000))
where bytes_scanned > 1000
and query_text like '%--Query%'
and start_time > '2023-02-06 18:00'
order by start_time desc
);

The above query does this:

  • Look for all queries in the recent query history within the classic information_schema.query_history.
  • It looks for all the queries annotated with a comment like --Query11. This is how we find the TPCH related queries, and it helps it to extract the id of each query for our convenience.
  • We are only interested in queries that scanned more than 1000 bytes (make sure to turn off the results cache to get the queries to actually run every time).
  • Then the query uses a combination of each query id and lag() to identify each different benchmark run, and give each run a different session_id.

None of this is new, but it’s pretty convenient to identify each of the benchmarking runs and how long it took:

select warehouse_size, session_id, query_tag, sum(total_elapsed_time), sum(bytes_scanned)
, count(*), min(start_time) start_time
from query_sessions2
group by 1,2, 3
order by session_id;
The numbers for each run of the TPCH benchmarks

Note that I also used the handy alter session set query_tag = ‘small-cold’; to annotate each query with a manual tag.

The table shows that running the 22 queries on a Small warehouse took 32s the 1st time, and 18s the 2nd time. Then on a XL warehouse it took 25s the 1st time, and 13s the 2nd time. We also got to use the session_id we created previously, which makes our stats easier to compute.

You probably already know this: The XL warehouse is faster as it offers more parallelism, and the second run of each is faster as it uses the Snowflake local disk cache. This means that your queries on Snowflake are faster on warm data — which is the typical scenario when you are running a series of queries.

Let’s go deeper — get all the query stats

So now comes the trick to go over these 88 queries (22*4) and get their query stats: A little SQL script.

First let’s create a table that will hold the query stats we are going to collect, plus the convenient TPCH query id and session id we produced earlier:

create or replace table query_operator_stats_cache2
as
select 1::integer qid, 1::integer session_id, ''::string query_tag, *
from table(get_query_operator_stats(last_query_id()))
-- just for the schema
limit 0;

The trick for the above query is to run get_query_operator_stats over a random query to re-use its schema, with a limit 0 as I only want the schema in this case.

Then the following script can go over the 88 queries we ran during our benchmarks, and store the stats for each on the table we just created — plus our convenient query id and session id:

declare
query_id string;
query_tag string;
qid int;
session_id int;
c1 cursor for
select query_id, qid, session_id, query_tag
from query_sessions2
where query_id not in (select query_id from query_operator_stats_cache2);
begin
open c1;
for record in c1 do
fetch c1 into query_id, qid, session_id, query_tag;
insert into query_operator_stats_cache2
select :qid, :session_id, :query_tag, * from table(get_query_operator_stats(:query_id));
end for;
return query_id;
end;

Note that I had to run it many times, as get_query_operator_stats() can timeout. The query_id not in statement makes sure that each time we run the script, it only add missing data.

Analyzing the table scans

Now that we have stored the detailed query stats for 88 queries, we can have fun analyzing them. For example, this view will isolate the stats of only the table scan steps:

create or replace view analyze_table_scans2
as
select qid, query_id, session_id, query_tag
, operator_attributes:table_name::string tablename
, array_size(operator_attributes:columns) table_ncols
, execution_time_breakdown:overall_percentage::float time_percent
, execution_time_breakdown:local_disk_io::float local_disk_io
, execution_time_breakdown:processing::float processing
, execution_time_breakdown:synchronization::float synchronization
, operator_statistics:io:bytes_scanned::int bytes_scanned
, operator_statistics:io:percentage_scanned_from_cache::float percentage_scanned_from_cache
, operator_statistics:network:network_bytes::int network_bytes
, operator_statistics:pruning:partitions_scanned::int scanned
, operator_statistics:pruning:partitions_total::int total
, scanned/total ratio
from query_operator_stats_cache2
where operator_type = 'TableScan';
Detailed stats for each table scanned within each query

With easy access to these stats, now we can focus on any aspect — for example on what percentage of data was scanned from the local disk cache on each session:

select session_id, query_tag, avg(percentage_scanned_from_cache)
from analyze_table_scans2
group by 1, 2
order by session_id;
Average percentage scanned from cache for each benchmark run

Looking at these results we get to learn something new: The second run was faster for the S and XL session, because in both second runs the data was “warm” and available to be scanned from cache. But only the 2nd run on the XL warehouse was able to get 100% of data out of the local disk cache — the S warehouse still had to go fetch some of it on the warm run.

This shows that the XL warehouse is not only faster because it offers more parallelism to run each query — but it also has a larger local disk cache that — in this case — is able to hold all the data scanned warm between rounds.

Then with the detailed stats, you can find out what tables had to be re-read from cold-storage, and in which queries:

select qid, tablename, percentage_scanned_from_cache, bytes_scanned
from analyze_table_scans2
where session_id=1
order by percentage_scanned_from_cache;
Find out which tables had to bring data back from cold storage in the wh-Small 2nd run

Here you can see that it’s the table LINEITEM that had to be re-read from cold storage during the 2nd run in the small warehouse — within query #19, #12, #1, and #3 — but in all cases it was only a small percentage of that table. Meanwhile on the XL warehouse all the warm data stayed cached on the local disk.

Improving performance: Partition pruning

There’s something interesting happening in all these queries — we can see that there almost no partition pruning happened:

select sum(scanned) partitions_scanned, sum(total) partitions_total, partitions_scanned/partitions_total ratio
from analyze_table_scans2
order by qid, session_id, tablename;
Almost no partition pruning

Turns out the tables shared by Snowflake on the snowflake_sample_data.tpch_sf10 schema have not been optimized for querying.

To fix these, I rematerialized the 2 biggest tables in my own schema — this time ordering the rows by date:

create table lineitem as
select *
from snowflake_sample_data.tpch_sf10.lineitem
order by l_shipdate;

create table orders as
select *
from snowflake_sample_data.tpch_sf10.orders
order by o_orderdate;
Sorting and re-materializing in a small cold wh: 34.3s

Then I created views so the original queries could find the existing smaller tables when being run in the new schema:

create view part as select *
from snowflake_sample_data.tpch_sf10.part;
create view supplier as select *
from snowflake_sample_data.tpch_sf10.supplier;
create view partsupp as select *
from snowflake_sample_data.tpch_sf10.partsupp;
create view nation as select *
from snowflake_sample_data.tpch_sf10.nation;
create view region as select *
from snowflake_sample_data.tpch_sf10.region;
create view customer asselect *
from snowflake_sample_data.tpch_sf10.customer;

With this done, I was able to get faster results, thanks to some effective partition pruning:

select session_id
, sum(scanned) partitions_scanned
, sum(total) partitions_total
, partitions_scanned/partitions_total ratio
, count(distinct qid) queries
from analyze_table_scans2
group by 1
order by 1
Session 4 to 7 had 13% less total partitions, and scanned 36% less partitions

Even better, time within a Small warehouse for a cold run went from 32.4s down to 28.6s — a 11% performance increase.

select warehouse_size, session_id, query_tag, sum(total_elapsed_time), sum(bytes_scanned)
, count(*), min(start_time) start_time
from query_sessions2
group by 1,2, 3
order by session_id;
Session 4 was 18% faster than session 0: Pruning wins.

Lessons learned

  • You can use the new get_query_operator_stats() to get deep stats for your queries.
  • Within these deep stats you can find information like the # of partitions scanned by each query.
  • Storing tables ordered by date can help a lot with pruning — as in this specific case.
  • Snowflake should re-materialize the shared sample tpch_sf10 tables, so anyone benchmarking them can get faster results with no effort.
  • Even when Snowflake had not optimized these tables for the benchmark, Mimoune Djouallah found out that Snowflake was the fastest of the benchmarked candidates. Snowflake is fast — and with small changes we can make it even faster.
  • Note that any benchmarks shown here reflect an arbitrary set up at an arbitrary point of time. Your numbers might vary.

Now it’s time to run your own experiments, and please report back your results.

Image generated by AI

Next steps

  • Check the docs for an example on Identifying “Exploding” Join Operators.
  • Run your own experiments with get_query_operator_stats.
  • Please share the results with us.
  • Who is going to use all these stats to write a cool new way to visualize the query profiles? (Bonus points if you use Streamlit).

Upcoming visualizations

Teaser: I know of some pretty cool visualizations from Snowflake partners that will soon be ready to share. Stay tuned!

Teaser: I know of some pretty cool visualizations from Snowflake partners that will soon be ready to share. Stay tuned!

Some ideas

TJ Murphy has a lot of great ideas:

  1. Detect and alert when operators in a query are bad. Table scans that aren’t pruning, exploding joins, spillage, etc.
  2. Cross-query cost monitoring. You can use this to classify query costs into more semantically meaningful groups: read, write, transform, aggregate, etc. How you optimize and manage each of those groups is different, so split them out.
  3. Adaptive query batching. Use historic profiles of the same query (like a dbt model) and use that to determine if it should be batched. Batching a transform costs more in IO but less in compute in total when the query is memory constrained or explodes rows.
  4. Column level cost contribution. There’s some advanced optimization techniques that require column-level metadata. A query could start to spill because you included 1 extra column that’s too wide: expensive to hold in memory. You can -almost- infer operator level memory usage.
  5. Join condition detection. In simple cases, you can check all the Join operators, use sqlglot to parse the join conditions, and check for things like range joins, UDF usage, regex usage, high memory columns, etc. Then alert or get fancy and mitigate by autorewriting the query
  6. CTE inlining failure 95% of the time, CTEs are safe because the optimizer will take care of it. When that doesn’t happen you can detect it and alert or rewrite.
  7. Better reclustering Aggregate the predicates used on your tables to determine a more optimal clustering layout. I even think you could probably leverage no-op materialized views to cleanly get 2+ clustering layouts on the same table with minimal overhead
  8. Detect unnecessary filters. Most filtering operations are standalone ops. Just check them all to see if they even do anything. Eliminate the ones that don’t.

(thx @teej_m)

Performance notes

Consider Mimoune Djouallah impressed:

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Felipe Hoffa
Snowflake

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.