End-to-end Snowflake unified analytics platform from Snowpipe to Snowpark ML

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

7 min readDec 17, 2023

Why and how Skinny Mobile chose Snowflake to build its end-to-end marketing analytics solution. This blog focuses primarily on the adoption of Snowpark ML for the development and deployment of its machine learning models.

This blog is constituted of 4 main parts

1- The WHY

2- The ADVANTAGES this unified approach provides

3- The HOW giving a high-level walkthrough of the ML process using Snowpark ML

4- CONCLUSION

Why create machine learning models on Snowpark ML?

At Skinny Mobile, a division of Spark NZ, prioritising customer satisfaction is paramount. Consequently, we have undertaken the creation of machine learning models utilising Snowpark ML to better understand our customers’ needs and preferences. This empowers Skinny Mobile marketers to proactively present timely offers via our advanced decisioning engine, thereby enhancing customers’ satisfaction.

What are the advantages?

Forecasting likely behaviours to optimise service delivery is not a recent development at Spark NZ. What sets Skinny Mobile apart is the innovative technology employed to build and implement predictive models. We envisioned a simplified end-to-end data analytics solution powered within a unified environment on Snowflake.

This unified environment is illustrated in the high-level diagram below

End-to-end Snowflake unified analytics platform

Traditionally, Spark NZ follows a process of extracting data from Snowflake, with data scientists relying on Virtual machines for processing. Models deployment is carried out using Docker containers, and additional data is sourced from Snowflake. While effective, this approach involves unnecessary data transfers, leading to increased costs and complexity.

The adoption of Snowflake as a unified analytics platform enables us to use Snowpipe for ingesting over a billion rows daily coupled with Snowflake’s powerful compute for data engineering.

Finally, this data is consumed by Snowpark ML, model training, and inference take place through its APIs, leveraging Snowflake’s high-performance and scalable computing capabilities. Queries are executed lazily, improving performance, and the absence of data movement enhances governance. This streamlined methodology eliminates the necessity for intricate deployment pipelines and additional computing resources, resulting in significant cost and time savings.

The following diagram illustrates the Snowpark ML engineering process mentioned above

This integrated approach optimises efficiency and enhances the overall analytics process for Skinny Mobile.

How is the end-to-end ML process using Snowpark ML?

Let’s now go through the end-to-end machine learning process using Snowpark ML. Using a high-level diagram at the beginning of each section, the different stages of the process are circled in orange.

Connecting to Snowflake & sourcing data

In the first step, we need to source the data from Snowflake. The setup is easy and consists of creating a config file and connection parameters, which point to user’s role; database; and schema. Once the session is initiated, we can pull the data into Snowpark dataframe using the familiar sql commands syntax.

Note that the data is in Snowpark dataframes and all the data wrangling is pushed down to Snowflake compute.

Preparing & transforming the data

If needed, we can proceed with data transformation in Snowpark ML’s dataframes. Within the code snippet below, we are looking for duplicates. Logging into Snowsight and looking through the query history, we can see the auto-generated query and the visual display of its profile. This is the case with everything we run in Snowpark for python.

Snowpark also has its own libraries for data pre-processing purposes. Data scientists often need to do one hot encoding, which essentially means transforming categorical data such as occupation into binary variables (0,1). With snowflake.ml.preprocessing library this happens straight on Snowflake compute and is hence very fast.

Similarly data scientists can standardise the numeric variables using Snowflake’s ML library. In this example, we are using minmax scaler and the query execution is pushed down to Snowflake to compute.

Model Training

After the data has been prepared and transformed, we can move on to modelling. In this case, we decided to use XGBoost Classification. Again, there is a snowflake.modelling library available which ensures all the heavy lifting is passed down to Snowflake’s compute.

Once the model has been trained we can move onto scoring the model. In our case we looked at the model precision, recall, f1 score and roc auc score to ensure model performs well.

Model Training using Gridsearch

Often first iteration of models we build are not the best possible model. In this case Gridsearch comes to the rescue and provides a way to automatically check for the best set of parameters for the model. In our case, we have tested 3 different learning rates and 3 different numbers of estimators, which are essentially number of trees that the models is built on. This is very computationally intense but because we are using snowflake.ml.modelling library this is pushed down to Snowflake and can be parallelised, and we can even increase the compute to obtain the output in a reasonable time. The best model can then be selected for registering.

GridsearchCV is computationally expensive, but this is pushed down to Snowflake compute which can easily be scaled up by choosing larger data warehouses or utilising the Snowpark-optimised warehouses.

Model registry

After the optimal model has been selected the model can be logged into the model registry. It can contain information such as model name, version, description and the sample data (in our case first 100 rows) and model metrics which in this case is roc_auc score.

Once the model has been registered it can be seen in the Snowflake database and schema that was selected in the code. We can see the information about model artifacts, deployments, model metadata and registered models.

Model deployment

The model can be pushed down for inference as a user defined function in Snowflake through simple .deploy function. This UDF can now simply be called.

Model deployment is the engineering task of exposing an ML model to the rest of the world.

And finally, the user defined function containing our optimal model can now simply be used for prediction anytime by using .predict function. And it will get fully parallelised, scalable inference which is tracked in the registry on Snowflake compute without data leaving.

Model inference is the process of running live data through a trained model to make a prediction or solve a task.

With this, we have gone through the whole process of modelling on Snowpark ML using the latest libraries available.

To Conclude

Overall this has been quite straight forward to develop; train; and deploy our machine learning models using Snowpark ML.

Initially we faced performance issues due to the fact that we were using a standard Snowflake warehouse. And we were running into out-of-memory errors, which was a blocker using Snowpark ML for our machine learning workload.

Fortunately, Snowflake provided the necessary support and advised us to utilised a Snowpark optimised warehouse instead fixing our issue. This allowed us to continue with our unified data analytics solution vision in a confident manner.

As a next step and to complete our vision, we will be leveraging the Snowflake containers private preview with the aim to run our dbt feature engineering pipeline.

Finally, it’s important to emphasise that realising this vision would not have been possible without the assistance of many individuals at various levels and teams within Spark NZ. Special acknowledgment goes to the Skinny Mobile marketing team, which let us exploring the latest technology provided by Snowflake.