Snowflake’s Virtual Warehouse: Know the Unknowns !!

Snowflake-Virtual Warehouse

VIRTUAL WAREHOUSE & ITS COMPONENTS:

There has been so many blogs written around Snowflake virtual warehouse, in this blog I would try to explain some key concepts and a POV around internals of a warehouse. It would help address some of the areas related to warehouse which we listen very often like :

What is a node? What is a thread? What is a machine ? Can a single node configuration address the problem of concurrency?, etc..

Snowflake virtual warehouse:

Snowflake deploys computational nodes from cloud hyperscalers(AWS, G-Cloud/Azure) to build warehouses. One compute node is used by the X-Small, two by small warehouses, four by medium warehouses, so on & so forth. All nodes, independent of cloud provider, have 8 cores/threads.

Hence if we think logically if we set-up a X-Small warehouse then it essentially spins up EC2 instances(AWS) at the backend:

X-Small Warehouse to EC2 mapping.

To elaborate further, a single node warehouse can be thought like a super computer altogether that is spun up which has the configurations like 8 powerful CPUs(threads/cores), RAM(powerful 16GB), SSD storage(commonly called as local storage or warehouse cache) all these comes under EC2 instance of AWS. Hence think that your virtual warehouse is a EC2 instance which is completely & transparently managed by Snowflake. Imagine when we are even spinning up a warehouse of X-Small size which is always measured in single node how much power we get to support the data workloads. These configurations which I am mentioning are continuously being reviewed across each region/hyperscaler’s and they are subjected to change. Below is how you can simulate this as well:

Below is the example & a POV on how we can do a simulation test on this behavior on virtual warehouse.

--------------------------------------------
-- SESSION 1(In one of your worksheet):
--------------------------------------------
CREATE OR REPLACE WAREHOUSE demo_xsmall_wh WITH WAREHOUSE_SIZE='X-SMALL'; --Creating the X-small warehouse.
ALTER WAREHOUSE demo_xsmall_wh SET MAX_CONCURRENCY_LEVEL=1; --> This has to change since by default it is 8.

--Below is a big table of 100 Million recs.--
create or replace table DEMO_DB.DEMO_SCHEMA.CUSTOMER_TABLE
as select * from SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER;

--Do the validation
show parameters in warehouse demo_xsmall_wh;

--Execute the below query
select * from DEMO_DB.DEMO_SCHEMA.CUSTOMER_TABLE;

--------------------------------------------
--SESSION 2(2nd worksheet) --
--------------------------------------------
use warehouse demo_xsmall_wh;

--Execute the below query
select * from DEMO_DB.DEMO_SCHEMA.RANGE_TABLE; --This is a small table having 5 recs.


Please note that both queries run in parallel. Like just after running
1st query switch to the second query and then run it.


-- What would you observe ? -

1st iteration:

The query run in 2nd session wont run and it would go in queuing state
since the query in 1st session runs for longer. Only after waiting for
long the second query succeeds. This happens because MAX_CONCURRENCY_LEVEL=1
is set at a warehouse level that allows any query to run just once.

2nd iteration:
ALTER WAREHOUSE demo_xsmall_wh SET MAX_CONCURRENCY_LEVEL=2;
Then execute the similar steps as in 1st iteration, over here you would see
the 2nd query results comes out faster while the first query keeps on running.

Some of the key points to note over here:

The “max_concurrency_level” parameter is always by default set as 8 whenever we configure a new warehouse. And that new warehouse can be of any size like X-Small, Small, Medium, etc. For example if the warehouse is of size “medium” which is a “4 node/32 CPU powered EC2 machine” the initial value of “max_concurrency_level=8” set. Imagine there is a case where there is a need to reduce this warehouse level parameter as “1” for a medium sized warehouse, then the entire power of this warehouse can be utilized to run 1 query at a time.

The idea behind mentioning these configurations was to convey even if a warehouse is of size X-Small, then it is still very powerful.

A view on Warehouse node to thread mapping

The above image gives the mapping of Node to Core/CPU for a single cluster warehouse, I have given a purview for 2XL which can be expanded to 6XL with same ratio. The key commands to get these insights are as follows:

CREATE OR REPLACE WAREHOUSE DEMO_WH_XSMALL WITH WAREHOUSE_SIZE='X-SMALL';
CREATE OR REPLACE WAREHOUSE DEMO_WH_SMALL WITH WAREHOUSE_SIZE='SMALL';

SHOW WAREHOUSES LIKE '%SMALL%';
SHOW PARAMETERS IN WAREHOUSE DEMO_WH_SMALL; -- The o/p of this command is shown as below.
The default value is 8 for a X-Small warehouse

As stated in above example the default can be lowered as well based on the need. What does this example signify ? This meaning although our warehouse is configured as “X-Small” and it is a single cluster warehouse it still has a potential to run parallel processes. Think of another use case if we run 8 parallel sessions all processing data with smaller volume then this configuration can be an option to solve even your concurrent workload needs. Another example which can be given is around processing of the files:

Parallel Processing of files

A different number of files can be processed in parallel by a warehouse of varying sizes. The number of nodes doubles as the size of the warehouse grows. Eight threads, each able to process one file, are present on each node. This implies that up to 8 files can be processed concurrently by the smallest warehouse (XS), which has only one node and 8 threads!

Processing Power of Warehouse:

Snowflake uses parallel processing to run a query across several cores. Larger warehouses often run the queries faster because they have more processing power due to their higher number of cores.

A location to keep intermediate data sets is necessary for data processing. RAM is the fastest location to store and retrieve data, even faster than CPU caches. Snowflake will use SSD local storage to store data in between query execution steps once it is depleted. The term “spillage to local storage” describes this behavior; although local storage is still fast to access, it is not as quick as RAM. Snowflake will use remote storage if its local storage becomes exhausted.

S3 for AWS(one of the hyperscaler of Snowflake) is the cloud provider’s object store, which is referred to as remote storage. Although accessing remote storage is significantly slower than local storage, Snowflake will never terminate a query due to an out of memory error because remote storage is infinite.

Snowflake Internal Compute Architectural Components:

Below are the list of the components on Snowflake’s compute layer that is considered to be giving all the muscle power enabling the data workloads running in the environment.

Virtual Warehouse Components

Because Snowflake has almost no administration, we don’t need to install or update any software. A warehouse that we create has everything we need to manage the workload. It is unnecessary for us to worry about configuration and settings on the warehouse because we do not have access to it like you would in a virtual machine. Below is the architecture on how each of the 3 key architectural components talk to each other with “Compute” sitting in the middle of it.

Snowflake-Internal Architecture.

SUMMARY:

A virtual warehouse can process multiple queries concurrently, it is best to keep them fully loaded and with minimal queuing for optimal efficiency. This is a POV to get an understanding of the virtual warehouse in details, these configurations are subjected to change timely by Snowflake and I had given just a view on its design and numbers cannot be considered as the same always.

Please keep reading my blogs it is only going to encourage me in posting more such content. You can find me on LinkedIn by clicking here and on Medium here. Happy Learning :)

Awarded consecutively as “Data Superhero by Snowflake for year 2024 & 2023”. Links: here

Disclaimer: The views expressed here are mine alone and do not necessarily reflect the view of my current, former, or future employers.

--

--

Somen Swain
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Snowflake Data Superhero 2024 & 2023 | 4XSnowpro Certified | AWS Solution Architect Associate | Cloud Computing| Principal-Data Engineering at LTIMindtree