Testing the Snowflake Query Acceleration Service with a 17 TB table

Snowflake has an easy way to make queries faster while using the smallest warehouses: The Query Acceleration Service has been ready for production (GA) since February 2023. Let’s test it right now by scanning 17 terabytes of GitHub events.

Felipe Hoffa

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

6 min readJan 10, 2024

Amping up Snowflake’s compute

Traditionally Snowflake has offered 2 easy ways of increasing compute power when dealing with larger queries and concurrency:

Scale your session’s “virtual warehouse” to a larger size.
Set up “multi-cluster warehouses” that dynamically add more clusters to deal with peaks of concurrent usage.

With these 2 basic elements, users are able to set up policies to control costs and divide resources between different teams and workloads.

For example — for my Snowflake experiments I usually do everything on my own “Small Warehouse”. This keeps costs low, and it’s usually pretty fast and predictable. I only need to scale to larger warehouses when dealing with huge transformations and extracts, like the example we are going to play with today.

Scaling a warehouse up to get faster results on slower queries is super easy, barely an inconvenience. I can jump at any moment from a “Small WH” to an “Extra Large WH”, to a “Medium WH”, to a “4X-Large WH”, etc. This is cool, but then the question becomes: “How can I tell exactly what’s the best WH size for my upcoming queries?”

Instead of resizing warehouses, it would be really cool if I could run my whole session on a “Small WH” (or an “Extra Small WH”) and then I could have Snowflake automatically intercept my larger queries, and run them with way more resources in a “magic serverless” way.

And that’s exactly what the new Query Acceleration Service does. Let’s test it out here (with a Snowflake Enterprise Edition account).

Conversation and live demo featuring Query Acceleration Service

Extracting data from GitHub’s 17 Terabyte Archive

Querying GitHub Archive with Snowflake: The Essentials

These are the essential tips for querying GitHub’s ongoing history with Snowflake. Use these tips to explore and…

medium.com

For this experiment we are going to look into GH Archive — a collection of all GitHub events. I did a lot of experiments with it in my past life at Google, and now Cybersyn has made a copy of the GH Archive on the Snowflake Marketplace.

To bring this dataset into your Snowflake account, just ask your Account Admin to import it at no cost:

Importing the GH Archive into your Snowflake account

For more tips, check my post “Querying GitHub Archive with Snowflake: The Essentials”. In the meantime let’s continue with a straightforward example.

Note in the above screenshot that I renamed the incoming database to GHARCHIVE for cleaner querying.

Once we have GHARCHIVE in our account, we can see 3 tables — with the main one being events:

select count(*)
from gharchive.cybersyn.github_events;
-- 1.4s
-- 6,966,010,260

That’s 7 billion rows of rich history — and a lot of data to deal with. The first step when exploring datasets this large should be to extract a subset of rows with the data we are interested in:

For example, this is the whole history of Apache Iceberg on GitHub:

create or replace table gharchive_iceberg
as
select *
from gharchive.cybersyn.github_events
where repo_id = 158256479
order by created_at
-- 19m 52s small
-- 96s xxlarge
;

This table extraction took only 96 seconds on a “2X Large WH”, but some long ~22 minutes on my usual “Small WH”.

Can the Query Acceleration Service (QAS) help here? There’s a very easy way to tell:

select system$estimate_query_acceleration('01b191db-0603-f84f-002f-a0030023f256');

And the response is “yes” (when using the query id from the ~22m run):

{
  "queryUUID":"01b191db-0603-f84f-002f-a0030023f256",
  "status":"eligible", 
  "originalQueryTime":1191.759,
  "estimatedQueryTimes":{"1":608,"2":411,"4":254,"8":149,"31":55},
  "upperLimitScaleFactor":31
}

Snowflake is telling us that the query took 1191s, and if we had let the QAS service help, it could have taken between 608s and 55s — depending on the max scaling factor we would allow it (in this case, up to 31).

To test QAS, I created a new WH. To make this test more dramatic, I made it an “Extra Small” with unlimited scaling power:

use role sysadmin
;
create warehouse xs_acc
warehouse_size = xsmall
enable_query_acceleration = true
query_acceleration_max_scale_factor = 0
;
grant usage on warehouse xs_acc to public
;

If I use the “Extra Small WH with unlimited QAS”, Snowflake now automatically accelerates this query

use warehouse xs_acc
;
create or replace table gharchive_iceberg
as
select *
from gharchive.cybersyn.github_events
where repo_id = 158256479
order by created_at
-- 19m 52s small
-- 96s xxlarge
-- 77s xs_acc
;

To check the cost of this QAS query that ran in 77s while I was working within a “Extra Small WH” session, we can check the logs:

use role accountadmin
;
select * from
table(information_schema.query_acceleration_history())
;

-- CREDITS_USED WAREHOUSE_NAME NUM_FILES_SCANNED NUM_BYTES_SCANNED
-- 0.499199157   XS_ACC         296,389           5,025,053,295,104

We can see that the query scanned 5 terabytes of data, for a total cost of 0.4 credits. Depending on the region with an Enterprise Edition Snowflake account that should be around $1.5 dollars.

In comparison:

Small WH, Enterprise edition, 1192s: $3*2*1192/3600 = $1.99 (+ time between queries and auto-suspend)
2XL WH, Enterprise edition, 96s: $3*32*96/3600 = $2.56 (+ time between queries and auto-suspend)
QAS 21x auto-acceleration, within a X-Small session, 77s: $3*0.5 = $1.5 (serverless model, no auto-suspend needed for the QAS queries — but the XS session kept running for $3*1*67/3600=$0.06 extra)

This is the power of the Query Acceleration Service: When it works, we don’t need to worry anymore about re-sizing warehouses, and we can let Snowflake take care of the huge queries that need extra power.

Query Acceleration Caveats

QAS is Generally Available (GA) in Snowflake and ready for you to use.

However you will notice that it’s picky on which queries it decides to accelerate — and I expect this set of supported queries to grow over time.

You can find a handy history of the queries in your account that could have been accelerated:

SELECT query_id, eligible_query_acceleration_time
FROM snowflake.account_usage.query_acceleration_eligible
ORDER BY eligible_query_acceleration_time DESC;

The docs also list what kind of queries are not eligible for acceleration (for now):

Some queries are ineligible for query acceleration. The following are common reasons why a query cannot be accelerated:

- The query does not filter or aggregate.
- The filters are not selective enough. Alternatively, the GROUP BY expression has a high cardinality.
- There are not enough partitions. If there are not enough partitions to scan, the benefits of query acceleration are offset by the latency in acquiring resources for the query acceleration service.
- The query includes a LIMIT clause but does not have an ORDER BY clause.
- The query includes functions that return nondeterministic results (for example, SEQ or RANDOM).

For example this query could not get QAS during my tests:

select min(created_at), max(created_at), current_timestamp()
from gharchive.cybersyn.gh_events
where repo_id = 158256479
limit 10

But this one that produces the same results does — thanks to adding group by and order by:

select repo_id, min(created_at), max(created_at), current_timestamp()
from gharchive.cybersyn.github_events
where repo_id = 158256479
group by repo_id
order by repo_id
limit 10
-- 23s xs_acc

Next steps