Optimized Hyperparameter tuning with Snowflake

4 min readJul 6, 2023

THIS STORY IS KIND OF USELESS NOW! Snowflake now has built-in distributed Hyperparameter tuning with snowflake-ml-python! Check out how to master this here.

Hyperparameter tuning can be a long and tedious process with many options. Do you use a full grid search, random search, cross-validation, bayesian search, etc.? If you are on Snowflake you now have to decide what warehouse size (x-small, medium, etc.), warehouse type (Standard, Snowpark Optimized), single node training or do I use a multi-node UDTF? I have experimented with all of these scenarios to see what the speed and performance differences are.

In my previous article, I showcased how to use a UDTF to parallelize model tuning. Since then we have introduced Snowpark Optimized Warehouses along with Multi-Threaded model training in stored procedures and UDTFs. Let us start with Single Node training on a Standard warehouse.

The Data

We will use make_classification to create a 40,000-record dataset with 6 features. I like to use make_classification because it gives us the ability to test various-sized datasets to duplicate many combinations of feature and record counts.

Stored Procedure with Standard X-Small Warehouse

In this example we will use a Random Forest Classifier with Grid Search Cross validation. There are 288 combinations of Hyperparameters and 3 cross validations.

95.17% accuracy but this took 46 min 40 seconds on an x-small warehouse. Although this is fairly cheap (less than 1 credit), that is a long time to sit around and wait. Let us see if changing to a Snowpark Optimized warehouse will help.

Stored Procedure with Snowpark-optimized Warehouse

Snowpark Optimized Warehouses start at a medium, and although the medium is still single node (node count increases as the size increases) it offers 16x the memory.

We get similar accuracy but the execution time is now only 11 min 50 seconds compared to the former execution time of 46 min 40 seconds resulting in a 4x improvement. Some users think increasing the warehouse size will decrease the runtime. Since Stored Procedures are single node scaling up to a large theoretically should not decrease the run-time but let us test it out to see for ourselves.

As we thought the runtime is the same within the second.

User Defined Table Function with Medium Standard Warehouse

UDTFs can be a great way to parallelize your model training across nodes on the warehouse. With a medium warehouse, we now have 4x the nodes as an x-small and can utilize all the available nodes/threads on the warehouse via the UDTF.

These results show similar accuracy, with approximately a 15% decrease in execution time to 10 minutes 11 seconds, but on a less expensive warehouse since we were able to use a standard. To see the parallelization we can look at the query profile. In this profile overview, we will notice all the Python invocations that were run across the nodes and the Python sandbox max memory usage which lets us know a standard warehouse is ok. If we reach higher memory usage we can switch over to a Snowpark-optimized warehouse for larger datasets.

User Defined Table Function with 2XL Standard Warehouse for Increased Parallelization

Since UDTFs give us the ability to utilize all nodes scaling up speeds up the execution time. Running the same UDTF code but with a 2XL standard warehouse will give us faster results.

We again get a similar accuracy but done in only 3 minutes 51 seconds costing ~2 credits.

Conclusion, Results & Code

There are many methods to optimizing your model training and hyperparameter tuning, and whether you want to prioritize speed or costs Snowflake gives you the ability to do both. In this example, the cheapest scenario was the UDTF on a medium warehouse which cost only .68 credits and finished in 10 minutes 11 seconds. The fastest approach was the UDTF on a 2XL finishing in 3 minutes 51 seconds costing ~2 credits.

The entire project's code which is displayed as a Hex project can be found below.

Link to full code