Linear Regression with Snowflake and SKLearn

Soonmo Seong
Cloud Villains
Published in
3 min readJan 24, 2024

We are going to develop a linear regression model that predicts total annual income of people living in NYC. You can implement ML model building workflow in Snowflake after reading this step by step blog. In detail, this story consists of

  • Getting Data from Snowflake Marketplace
  • Querying Data in Snowsight
  • Connecting Snowflake through Snowpark
  • Training a Model and Inferencing

Getting Data from Snowflake Marketplace

Let’s take an advantage of free datasets in Snowflake Marketplace, which is a useful place . Cybersyn provides free datasets for annual income by ZIP code.

Go to Marketplace and search Cybersyn. You can see US Housing & Real Estate Essentials. After clicking that, you can get the data we are looking for from the middle right side.

Querying Data in Snowsight

Just one dataset isn’t enough to make a model that we want to develop. So, below SQL code gives us a base dataset for model training. Run this code in Snowsight SQL Worksheet.

use database us_housing__real_estate_essentials;
use schema cybersyn;

-- create new york city incom data
---------------------------------------------------------
create or replace database snowpark_ml_modeling;
create or replace table SNOWPARK_ML_MODELING.PUBLIC.nyc_zip_income
(
zip varchar,
year number,
annual_income number
)
as with nyc_zip as (
select id_zip, city, state, zip
from us_housing__real_estate_essentials.cybersyn.us_addresses
where state = 'NY'
and city = 'New York'
group by id_zip, city, state, zip
)
select zip, year(date), value
from us_housing__real_estate_essentials.cybersyn.irs_individual_income_timeseries ii
join nyc_zip on (nyc_zip.id_zip = ii.geo_id)
where variable = 'Total_income_-_Amount,_AGI_bin:_Total'
;

select * from snowpark_ml_modeling.public.nyc_zip_income
order by zip, year
;

You can preview this dataset as we queried.

Connecting Snowflake through Snowpark

Snowpark is a instrumental tool while building machine learning system because it supports Python. Python Snowpark is mainly used nexst sections.

All the work in Snowpark start from Session. So we create a session with connection_parameters. Please replace XXXXXXXXXXXXX with your own account, user, and password.

from snowflake.snowpark.session import Session
from snowflake.snowpark.types import IntegerType, FloatType
from snowflake.snowpark.functions import avg, sum, col, udf, call_udf, call_builtin, year, month
import streamlit as st
import pandas as pd

# scikit-learn (install: pip install -U scikit-learn)
from sklearn.linear_model import LinearRegression

# Session
connection_parameters = {
"account": "XXXXXXXXXXXXX",
"user": "XXXXXXXXXXXXX",
"password": "XXXXXXXXXXXXX",
"role": "ACCOUNTADMIN",
"warehouse": "COMPUTE_WH",
"database": "SNOWPARK_ML_MODELING",
"schema": "PUBLIC"
}

session = Session.builder.configs(connection_parameters).create()

Let’s query the base dataset that we made above. You will see you are connected to Snowflake sucessfully.

session.table("NYC_ZIP_INCOME").show()

Training a Model

Feature Engineering: One hot encoding

ZIP code looks like numbers but it isn’t. So the notorious feature is one-hot encoded to help ML model understand.

df = session.table("NYC_ZIP_INCOME").to_pandas()

# split X and y for training a regression model
X = df[['ZIP', 'YEAR']]
y = df['ANNUAL_INCOME']

# one hot encoding for zip code because it's not a number
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
ohe.fit(X[['ZIP']])
ohe_zip = ohe.transform(X[['ZIP']]).toarray()
transformed_df = pd.DataFrame(
ohe_zip,
columns=ohe.get_feature_names_out()
)

X = pd.concat([transformed_df, df['YEAR']], axis=1)

Model Training

Linear regression model is trained using sklearn .

# fit the regression model
regr = LinearRegression()
regr.fit(X, y)

Inferencing

Inference function is below. We should transform the input data such as zip_code and year as we transformed the training dataset. The 2025 total income of people living in ZIP code 10001 is about USD 3,045,585,620.

# create a prediction function
def annual_income_predictor(zip_code, year):
# one hot encoding for zip code
input = pd.DataFrame(ohe.transform([[str(zip_code)]]).toarray())
# concat one hot encoded zip code with year
input = pd.concat([input, pd.DataFrame([year])], axis=1)
return float(regr.predict(input))
# result = 3045585629.5694885
annual_income_predictor(10001, 2025)

Under Snowflake Data Cloud Platform, we simply pass through how to create a ML model with sklearn and snowpark.

--

--