Feature Engineering in Snowflake

In a previous story, I demonstrated that a true cloud data warehouse is capable of handling common machine learning tasks for structured data, like training tree-based models.

The gist of my argument was that if the architecture is right and the use case is common, then you shouldn’t need to transfer the data out of your database into a general-purpose compute cluster. As a Snowflake user, your analytics workloads can take advantage of its micro-partitioning to prune away a lot of of the processing, and the warmed-up, per-second-billed compute clusters are ready to step in for very short but heavy number-crunching tasks.

However, it’s often said that 80% of a data scientist’s time is spent feature engineering, and it’s almost always a requirement before you can train a model. Can this part be done in-database as well?

What is feature engineering?

Feature engineering is about transforming the input data so that it can be used by a machine learning algorithm. There is a lot of contextual meaning behind data that humans intuitively interpret using our vast experience and ability to generalize our learning. But despite some of the marketing material that’s out there, machines aren’t able to do this on their own just yet.

There are several reasons why some ML algorithms require their inputs to be “engineered”, but to illustrate with an example: let’s say two of your input variables are salary and age, and your target variable is the likelihood to purchase a product. Salary might range between $30,000 and $200,000 in your dataset, and age could be between 10 and 90. You the reader probably understand that an age difference of 20 years is a lot more significant than a salary difference of $20, but to an algorithm they are just numbers it needs to fit a curve through. So in order to do the best job, computers need you to allow for that, which in this case you could do by scaling (e.g. make salary and age both range between 0.0 and 1.0). You can also use techniques like binning (aka bucketing) to place values into one of a fixed number of value ranges (e.g. salary bands 0 to 6).

Implementing in-database

When you look at something like scikit-learn’s preprocessing libraries, the prospect of implementing them from scratch to run in-database may seem quite daunting, conjuring up images of thousands of lines of code dealing with low level numeric operations.

The reason I don’t find it daunting is that most of it is not actually from scratch. Conceptually, it’s more like implementing the top parts of a pyramid where a lot of the base level building blocks are already in place. For example, in my decision tree algorithm I used standard deviation reduction to calculate the tree splits. But I didn’t have to implement standard deviation from scratch; that’s a built-in Snowflake function, I just needed to plug it in at the right place. And it’s not just the aggregate functions that are useful; the window functions help with breaking up datasets, and the table functions are great for slicing and dicing.

The same principle should apply to feature engineering. And if you do find a missing building block or just want to wrap things up all nice and tidy, you can always be a good engineer and solve it in isolation as a Snowflake function that you then call as needed. Read on, and you’ll see I do this several times.

In this story, I’m going to work my way through the full list of scikit-learn preprocessing functions and show an equivalent Snowflake solution, using the sample data available to all Snowflake accounts.

Unless otherwise specified, the database is “SNOWFLAKE_SAMPLE_DATA” and the schema is “TPCDS_SF10TCL”.

If you’re running these queries in Tableau Desktop like I did, make sure to add a sample(500 rows) at the end, as some of these tables have billions of rows that will really crowd a chart.

preprocessing.Binarizer

Binarize data (set feature values to 0 or 1) according to a threshold

Nice easy one as a warm-up exercise, let’s binarize the customer_demographics dependents count. Assume it’s only relevant that you have dependents or not, and we don’t care how many.

Before
select iff(CD_DEP_COUNT>0, true, false ) as CD_DEPENDANTS
from CUSTOMER_DEMOGRAPHICS
After

preprocessing.FunctionTransformer

Constructs a transformer from an arbitrary callable.

This one just lets you BYO transform function, so not really an algorithm as such. We know we can do the same with a stored proc or function.

preprocessing.KBinsDiscretizer

Bin continuous data into intervals.

There are a couple of choices to make here in the scikit-learn function. The encoder parameter can be ‘onehot’, ‘onehot-dense’, or ‘ordinal’. I’ll go the easy route with ordinal but don’t worry, we have to implement one hot encoding later as its own function.

Similarly I’ll take a strategy of ‘uniform’ for simplicity, but I plan to tackle k-means in a future post.

Here’s the SS_LIST_PRICE from STORE_SALES:

Before

Placed into 20 equal width bins:

with aggregates as (
select min(SS_LIST_PRICE) as min_price,
max(SS_LIST_PRICE) as max_price
from STORE_SALES)
select width_bucket("SS_LIST_PRICE",min_price,max_price,20)
from aggregates,STORE_SALES
limit 100
After

And here’s what the bins look like:

preprocessing.KernelCenterer

Center a kernel matrix

I don’t know about you, but I don’t often encounter kernel matrixes in the context of relational databases, so it’s difficult to image how this could apply. Post a comment if you think I’ve let myself off the hook too easily here.

preprocessing.LabelBinarizer

Binarize labels in a one-vs-all fashion

This is so similar to OneHotEncoder that I won’t repeat myself here.

preprocessing.LabelEncoder

Encode labels with value between 0 and n_classes-1.

Here we make use of some of the cool array functions in Snowflake, pushing all the distinct values into an array with array_agg so that we can use the array_position function to encode the label.

Before
with distinct_values as (
select array_agg(distinct cd_marital_status) as marital_status_array from CUSTOMER_DEMOGRAPHICS)
select array_position(cd_marital_status::variant,marital_status_array)
from distinct_values,CUSTOMER_DEMOGRAPHICS
After

preprocessing.MultiLabelBinarizer

Transform between iterable of iterables and a multilabel format.

The only way this would really apply is if we were storing multi-label data using Snowflake arrays. I doubt this would be common, but it’s not a crazy idea either. I’ll create an example to match the second one in the doco.

Sample table creation:

create or replace temporary table films as (
select array_construct('sci-fi','thriller') as genres
union
select array_construct('comedy'));

select * from films

We’ll also need a simple function that, given a complete array of possible values, tells us whether or not each value was present in a given subset.

create or replace function array_element_matches(MASTER_ARRAY array,ELEMENTS array)
returns array
language javascript
as
$$
return MASTER_ARRAY.map(function(master_element){
for (var i=0;i<ELEMENTS.length;i++){
if (ELEMENTS[i]==master_element){
return true;
}
}
return false;
});
$$
;

The function is much easier to explain by demonstration:

select array_element_matches(
array_construct('a','b','c','d','e'),
array_construct('b','d')) as test

With the function created, we can go ahead and make a multi-label binarizer for our films. First we flatten all the arrays into a single array and get a list of distinct values.

Then we can use the function to do the binarizing.

with distinct_values as (
select array_agg(distinct value::string) as all_genres_array from films
,lateral flatten(input => genres) g
)
select all_genres_array,genres,array_element_matches(all_genres_array,genres) as genres_mlb
from distinct_values,films

Let me just explain the array_agg + lateral flatten part. When you call array_agg on array values, you get arrays containing arrays (e.g. [ [1,2],[3,4] ]). These are hard to work with, so we needed to flatten them all out into one big single array ([1,2,3,4]).

preprocessing.MaxAbsScaler

Scale each feature by its maximum absolute value.

Before

This is as simple as dividing each value by the overall maximum absolute value:

with aggregates as (
select max(abs(WS_LIST_PRICE)) as max_abs_dep
from WEB_SALES)
select WS_LIST_PRICE,max_abs_dep,WS_LIST_PRICE / max_abs_dep AS WS_LIST_PRICE_SCALED
from aggregates,WEB_SALES
After

preprocessing.MinMaxScaler

Transforms features by scaling each feature to a given range.

Before

This is similar to MaxAbsScaler, except we use the minimum and maximum values instead.

with aggregates as (
select min(CD_DEP_COUNT) as min_dep,max(CD_DEP_COUNT) as max_dep
from CUSTOMER_DEMOGRAPHICS)
select (CD_DEP_COUNT - min_dep)
/ (max_dep - min_dep)
from aggregates,CUSTOMER_DEMOGRAPHICS
After

preprocessing.Normalizer

Normalize samples individually to unit norm.

There’s a divide by zero safeguard we need for this formula, so let’s wrap it in a function. For this example we’ll use the L2 norm.

create or replace function normalize_scale(x float,y float)
returns float
language javascript
as
$$
var magnitude=Math.sqrt(X**2 + Y**2);
if (magnitude==0){
return 0;
}else{
return X/magnitude;
}
$$
;

Here’s CD_DEP_EMPLOYED_COUNT and CD_DEP_COLLEGE_COUNT from CUSTOMER_DEMOGRAPHICS. They are discrete values rather than continuous, but you get the idea.

Before

Then, to normalize CD_DEP_EMPLOYED_COUNT in relation to itself and CD_DEP_COLLEGE_COUNT:

select 
normalize_scale(CD_DEP_EMPLOYED_COUNT,CD_DEP_COLLEGE_COUNT) as dep_employed_count_normalized,
normalize_scale(CD_DEP_COLLEGE_COUNT,CD_DEP_EMPLOYED_COUNT) as dep_college_count_normalized
from CUSTOMER_DEMOGRAPHICS
limit 1000000

preprocessing.OneHotEncoder

Encode categorical integer features as a one-hot numeric array.

Whether you encode categories or labels is neither here nor there, it’s just an if statement. And since that would provide one feature per column, you can also just select the particular features you plan to use.

Let’s go back to CD_MARITAL_STATUS from CUSTOMER_DEMOGRAPHICS:

Before
select 
iff(cd_marital_status='S',true,false) as MARITAL_STATUS_S,
iff(cd_marital_status='D',true,false) as MARITAL_STATUS_D,
iff(cd_marital_status='W',true,false) as MARITAL_STATUS_W,
iff(cd_marital_status='U',true,false) as MARITAL_STATUS_U
from CUSTOMER_DEMOGRAPHICS
After

Note that you could easily roll up labels here, e.g. treat divorced and widowed as the same feature.

preprocessing.OrdinalEncoder

Encode categorical features as an integer array.

There’s really no point to this in a relational database, it’s not like you’re paying by the column. If you have a fixed number of features in an array, just break them out into their own columns and encode them separately.

preprocessing.PolynomialFeatures

Generate polynomial and interaction features.

These can be done trivially by adding the calculated columns you’re planning to use (using multiplication, SQUARE, etc).

preprocessing.PowerTransformer

Apply a power transform featurewise to make data more Gaussian-like.

‘yeo-johnson’ or ‘box-cox’ are the two functions that can be used to build a normal distribution. While they aren’t too complex, there’s some parameters that need to be automatically chosen and this is an area where some building is required. Scikit-learn contains a lot of parameter optimization functions (in this case, Brent’s method), and this would be a big one to implement. Maybe in a future post?

preprocessing.QuantileTransformer

Transform features using quantiles information.

In a nutshell, the scikit-learn implementation:

  1. calculates a bunch of percentiles(1000 by default)
  2. uses linear interpolation with these percentiles to map each value to a 0–1 range.

I’ll try to visualise this first. Using the C_BIRTH_YEAR column of the CUSTOMER table, we can imagine the years spread out across the x axis.

Choosing 10 as a simple number of percentiles, we can calculate these and plot these on the y axis. The first dot (10%, or 0.1) shows the birth year you’d have to choose to get 10% of values to its left and 90% to its right. If you were doing them one at a time, you could use one of Snowflake’s PERCENTILE_* functions, but we need to generate them all in a query.

-- Make a 0-percentile row containing the minimum value 
-- (i.e. 100% of values fall to its right)
select 0 as quantile,(select min(SS_LIST_PRICE) from STORE_SALES) as quantile_value
union all
-- generate 10 percentile values (10% increments) and for each, determine the maximum value that
-- will divide the dataset row counts by that percentage
select quantile,max(case when rownum/numrows <= quantile then SS_LIST_PRICE end) as quantile_value
from
(
select
row_number() over (partition by null order by null) -1 as seq,
0.1+(seq/10) as quantile
from table(generator(rowcount => 10)) v
order by 1
) quantiles,
(
select SS_LIST_PRICE,
row_number() over (partition by NULL order by SS_LIST_PRICE) as rownum,
count(*) over (partition by NULL) as numrows
from STORE_SALES sample(10000 rows)
where SS_LIST_PRICE is not null
) totals
group by quantile
order by quantile_value

Which gives us the list price percentiles:

Now let’s bring it all together on a 2D chart:

Once we have our percentiles calculated, it’s a matter of going through every SS_LIST_PRICE value across the x axis, determining which two blue dots (percentiles) are the closest either side of it, and then using the Linear Interpolation formula to determine its value on the y axis (the red question mark).

First, we need a Linear Interpolation formula (often known as “lerp” by the character-count conscious):

create or replace function linear_interpolation(x1 float,y1 float,x2 float,y2 float, x float)
returns float
as
$$
y1 + ((x-x1)/(x2-x1)) * (y2-y1)
$$
;

Then we expand our query further to tie it all together:

with quantile_values as(
-- Make a 0-percentile row containing the minimum value (i.e. 100% of values fall to its right)
select 0 as quantile,(select min(SS_LIST_PRICE) from STORE_SALES) as quantile_value
union all
-- generate 10 percentile values (10% increments) and for each, determine the maximum value that
-- will divide the dataset row counts by that percentage
select quantile,max(case when rownum/numrows <= quantile then SS_LIST_PRICE end) as quantile_value
from
(
select
row_number() over (partition by null order by null) -1 as seq,
0.1+(seq/10) as quantile
from table(generator(rowcount => 10)) v
order by 1
) quantiles,
(
select SS_LIST_PRICE,
row_number() over (partition by NULL order by SS_LIST_PRICE) as rownum,
count(*) over (partition by NULL) as numrows
from STORE_SALES sample(1000 rows)
where SS_LIST_PRICE is not null
) totals
group by quantile
order by quantile_value
)
select SS_LIST_PRICE as x,
(select max(b.quantile) from quantile_values b where b.quantile_value<a.SS_LIST_PRICE) as y1,
(select min(b.quantile) from quantile_values b where b.quantile_value>=a.SS_LIST_PRICE) as y2,
(select max(b.quantile_value) from quantile_values b where b.quantile_value<a.SS_LIST_PRICE) as x1,
(select min(b.quantile_value) from quantile_values b where b.quantile_value>=a.SS_LIST_PRICE) as x2,
coalesce(CONSORTIUM_SHARING.MOBILE.linear_interpolation(x1,y1,x2,y2,x),0) as y
from STORE_SALES a sample(1000 rows)
where SS_LIST_PRICE is not null
order by SS_LIST_PRICE

Note that the use of min & max in the final query is not ideal from a performance point of view, which is why I had to sample the table this time. We could instead build our own User Defined Table Function that did a one-shot pass of the data, leveraging what we know about the order of its values. But what we have serves to illustrate for now, and we are left with a distribution from 0 to 1:

After

preprocessing.RobustScaler

Scale features using statistics that are robust to outliers.

This method uses the interquartile range, so instead of min and max, we are splitting down the median then splitting again on each side.

In this example, we’ll scale WR_RETURN_AMT in the WEB_RETURNS table.

Before
with med as (select 
median(WR_RETURN_AMT) as median_ret_amt
from WEB_RETURNS),
lower_med as (select median(WR_RETURN_AMT) as lower_median
from WEB_RETURNS, med where WR_RETURN_AMT< median_ret_amt),
upper_med as (select median(WR_RETURN_AMT) as upper_median
from WEB_RETURNS, med where WR_RETURN_AMT> median_dep)
select (WR_RETURN_AMT- lower_median) / (upper_median - lower_median) as wr_return_amount_scaled
from lower_med,upper_med,WEB_RETURNS
After

preprocessing.StandardScaler

Standardize features by removing the mean and scaling to unit variance.

Using the same WEB_RETURNS column as previous example.

with aggregates as (
select avg(WR_RETURN_AMT) as avg_dep,
stddev(WR_RETURN_AMT) as stddev_dep
from WEB_RETURNS)
select (WR_RETURN_AMT - avg_dep)
/ (stddev_dep) as wr_return_amount_scaled
from aggregates,WEB_RETURNS

Conclusion

There’s a few things I’ve learned by writing this.

It works!

I’d call this exercise a roaring success. There’s a bit of boilerplate if you’re writing them as queries, but if they were being issued by front-end applications or as push-down statements in code libraries then that wouldn’t be a problem. PowerTransformer is doable, but I’d wait until I needed it.

Most of the functions are not scary

The sliding and squashing mostly involves standard aggregations. It’s easy to forget this when you call them by a different name for a long time.

However, when moving from a imperative language like Python to a more declarative language like SQL , it does tend to call for a rethink instead of a direct port of the code.

Tableau is a good sidekick

Visualisations are the best reinforcement of what you’re trying to achieve. With Snowflake, you can easily do feature engineering on very large datasets without sampling, just remember it’s not a good idea to ask Tableau to put billions of dots on a chart. But sampling the final result is much better than sampling the initial input!

As always, comments and feedback welcome.

--

--