Snowpark-Native Linear Regression Modeling

Published in

Cloud Villains

3 min readJan 26, 2024

In my previous blog, we built an annual income prediction model mainly with sklearn. Of course, we queried data through Snowpark. However, it wasn’t Snowpark-native. What a shame!!!

In honor of Snowpark ML Modeling GA, let’s make a linear regression model in a Snowpark-native way. It would be a pleasing journey. Just Follow me!

We are going to use the same dataset and ML algorithm as my previous blog. So, don’t be afraid. Everything is the same except the code. To get more understood, I provide both codes for Snowpark and sklearn.

You can get how to utilize Snowpark for ML modeling in this blog, which consists of

Creating Snowflake Session
Reading a Table
One-Hot Encoding
Fitting a Linear Regression Model
Predicting Annual Incomes

Creating Snowflake Session

We can connect Snowflake like my previous blog. Don’t forget to replace xxxxxxxxxxxxxxxx with your own config parameters.

from snowflake.snowpark.session import Session
from snowflake.ml.modeling.linear_model import LinearRegression
from snowflake.ml.modeling.preprocessing import OneHotEncoder


# Session
connection_parameters = {
    "account": "xxxxxxxxxxxxxxxx",
    "user": "xxxxxxxxxxxxxxxx",
    "password": "xxxxxxxxxxxxxxxx",
    "role": "ACCOUNTADMIN",
    "warehouse": "COMPUTE_WH",
    "database": "SNOWPARK_ML_MODELING",
    "schema": "PUBLIC"
}

session = Session.builder.configs(connection_parameters).create()

Reading a Table

Under the content of SNOWPARK_ML_MODELING.PUBLIC, you can read NYC_ZIP_INCOME table.

df = session.table("NYC_ZIP_INCOME")
df.show()

One-Hot Encoding

snowpark.ml.modeling has the same namespace as scikit-learn does. Unlike the namespace, arguments quite differ. So, OneHotEncoder does. That causes different data transformation as below.

# snowpark.ml.modeling
ohe = OneHotEncoder(input_cols = ['ZIP'], output_cols = ['ZIP_OHE'], drop_input_cols=True)
transformed_df = ohe.fit(df).transform(df)
input_columns = transformed_df.columns[:-1]
label_columns = transformed_df.columns[-1]
output_columns = ['PREDICTED_ANNUAL_INCOME']

############################################################################

# sklearn
X = df[['ZIP', 'YEAR']]
y = df['ANNUAL_INCOME']

ohe = OneHotEncoder()
ohe.fit(X[['ZIP']])
ohe_zip = ohe.transform(X[['ZIP']]).toarray()
transformed_df = pd.DataFrame(
    ohe_zip, 
    columns=ohe.get_feature_names_out()
)

X = pd.concat([transformed_df, df['YEAR']], axis=1)

Fitting a Linear Regression Model

Fitting a linear regression model isn’t that different. It’s easy going.

# snowpark.ml.modeling
regr = LinearRegression(
    input_cols = input_columns,
    label_cols = label_columns,
    output_cols = output_columns
)
regr.fit(transformed_df)

#############################################################################

# sklearn
regr = LinearRegression()
regr.fit(X, y)

Predicting Annual Incomes

This inference function is pretty different. The biggest difference is ohe.transfrom wants to have snowpark.dataframe as an input. So, if you are using Snowpark, create_dataframe should be first run.

# snowpark.ml.modeling
def annual_income_predictor(zip_code, year):
    
    # create snowpark dataframe
    input = session.create_dataframe([(str(zip_code), year)], schema = ['ZIP', 'YEAR'])
    
    # one hot encode input
    transformed_input = ohe.transform(input)
    
    # predict
    prediction = regr.predict(transformed_input)
    
    return prediction.select('PREDICTED_ANNUAL_INCOME').show()

# result = 3045585629.5694733
annual_income_predictor(10001, 2025)

#############################################################################

# sklearn
def annual_income_predictor(zip_code, year):
    # one hot encoding for zip code
    input = pd.DataFrame(ohe.transform([[str(zip_code)]]).toarray())
    # concat one hot encoded zip code with year
    input = pd.concat([input, pd.DataFrame([year])], axis=1)
    return float(regr.predict(input))

# result = 3045585629.5694885
annual_income_predictor(10001, 2025)

Snowpark ML and sklearn show almost the same results, confirming our implementation of Snowpark works well.

Summary

We compared snowpark.ml.modeling with sklearn in two ways: different coding style and its ramifications. Even though different arguments of the same namespace functions results in different approaches to data transformation, Snowflake-oriented framework, Snowpark, has benefits in performance and integrations.