Forecasting with Snowflake Cortex ML-based Functions

Predicting sales demand based on retail sites search terms

Photo by Andrea Schettino on Pexels

According to the Snowflake documentation, Cortex ML-based functions use machine learning to detect patterns in your data. You don’t have to be a machine learning developer to take advantage of them.

I thought it would be fun to explore the forecasting functions directly in SQL, which can be done quickly and easily from Snowsight worksheets. To get some sample data, I browsed the Snowflake Marketplace where I found the Onsite Search — Product demand analysis on retail sites and marketplaces data set provided by Similarweb Ltd. They offer a free version of their dataset which I got into my Snowflake account, naming the database ONSITE_SEARCH. There is a schema named DATAFEEDS and a table named ON_SITE_SEARCH.

The description of this data set says that it can be used to understand which brands and products consumers are searching for to keep up with market demand. Based on this data, I can use the Cortex ML-based forecasting functions in Snowflake to predict future demand.

Preliminary data analysis

With the sample data ready, I did some preliminary data analysis. Using a few simple SQL queries, I discovered that the free dataset includes data for keyword search on Amazon.com from the US between January 1, 2021 until June 30, 2022.

Here is an overall representation of all keyword searches by day for the entire dataset using the Chart feature in Snowsight worksheets to visualize the results:

-- all keyword searches by day
select date, count(*) as cnt
from ONSITE_SEARCH.DATAFEEDS.ON_SITE_SEARCH
group by all;
All keyword searches by day

It is obvious from the diagram that there is an outlier on June 21, 2021. After some digging I found out that this was Amazon Prime day, an annual shopping holiday sponsored by Amazon. Another outlier can be spotted between November 20 and 25, 2021 which is the time leading up to and including the Black Friday sales.

The keyword search also appears to be larger than usual during the month of December when shoppers usually look for Christmas gifts. There is also a noticeable slight dip in keyword search on December 24, 2021 which is Christmas Eve when the gifts have been bought and consumers probably spend more time with their families and less time on-line shopping. There appears to be a general downwards trend of keyword searches since the beginning of 2021, presumably due to Covid when consumers were shopping on-line more than usual.

It was fun also looking at a few individual keyword searches, for example toys:

-- keyword searches for toys by day
select date, sum(calibrated_visits) as cnt_toys
from ONSITE_SEARCH.DATAFEEDS.ON_SITE_SEARCH
where oss_keyword ilike '%toys%'
group by all;
Keyword searches for toys by day

As we can see from the chart, in addition to the two outliers related to Amazon Prime day and Black Friday sales, there is also a significant peak in keyword search during the month of December, presumably due to Christmas gift shopping when more toys are bought for children as compared to other items.

Some other keyword searches display different trends, for example backpacks are in highest demand during August for back-to-school shopping. Cat litter, on the other hand, doesn’t show peaks in keyword search, because cats require litter regularly, regardless of holiday seasons or special sales events.

But enough exploring, let’s go to forecasting. I will forecast the demand for toys in the future based on historical keyword searches.

Forecasting toy demand

Let’s start with a basic time-series forecasting model. To create a forecast, the model requires historical data based on which it will produce univariate predictions of future data.

The historical data must include:

  • a timestamp column, in our case the date of the search
  • a value column, in our case the number of keyword searches

Before we can start forecasting, let’s do some prep work:

-- create a role that will do the forecasting
use role useradmin;
create role data_scientist;

-- grant the role to myself so I can test it
grant role data_scientist to <my current user>;

-- grant privileges on the historical data database to the new role
use role accountadmin;
grant imported privileges on database onsite_search to role data_scientist;

-- create a warehouse, database, and schema and grant privileges
use role sysadmin;
create database forecast_db;
grant usage on database forecast_db to role data_scientist;
create schema time_series;
grant all on schema time_series to role data_scientist;
create warehouse forecast_wh with warehouse_size = 'xsmall';
grant usage on warehouse forecast_wh to role data_scientist;

-- continue working with the newly created role
use role data_scientist;
use warehouse forecast_wh;
use database forecast_db;
use schema time_series;

We will use the DATA_SCIENTIST role from now on. First, we must prepare the historical data for all keyword searches that match '%toys%'. To do that, we will create a view as follows:

-- create a view that represents historical data for toys keyword search
create or replace view toys_keyword_data as
select
date::TIMESTAMP_NTZ AS search_date,
sum(calibrated_visits) AS search_cnt
from ONSITE_SEARCH.DATAFEEDS.ON_SITE_SEARCH
where oss_keyword ilike '%toys%'
group by all;

Using this historical data, we can now train our model. We will call the SNOWFLAKE.ML.FORECAST function, to which we must provide the input data (in our case the TOYS_KEYWORD_DATA view that we created earlier), and we must specify which column represents the timestamp and which column represents the target value:

-- train the model on historical data
create SNOWFLAKE.ML.FORECAST toys_model(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'toys_keyword_data'),
TIMESTAMP_COLNAME => 'search_date',
TARGET_COLNAME => 'search_cnt'
);

Note that we have provided the view TOYS_KEYWORD_DATA as a reference because by default, the FORECAST function does not have access to data. By passing the table or view as a reference, we pass along the caller’s privileges.

We now have a trained model named TOYS_MODEL stored as an object in the current schema.

To predict future values, we call the FORECAST function of this model and provide a single parameter which is the number of time periods in the future. Since the historical data is in days, the future time period must also be specified in days. To predict future values for the next 6 months, the parameter value is 184 days:

-- call the model to predict future values
call toys_model!FORECAST(FORECASTING_PERIODS => 184);

The output of the FORECAST function contains four columns:

  • TS — the timestamp of the predicted value
  • FORECAST — the predicted value
  • LOWER_BOUND — the prediction lower bound
  • UPPER_BOUND — the prediction upper bound

To save these results to a table named TOYS_MODEL_FORECAST so that we can analyze and plot them later, we can use the RESULT_SCAN() function:

-- save the results of the FORECAST output as a table
create or replace table toys_model_forecast as
select * from table(result_scan(last_query_id()));

We want to plot the historical and predicted values on the same diagram. We can construct the following query to combine these values and display the results as a line chart in Snowsight:

-- combine historical and predicted values in a single query
select search_date, search_cnt, null as forecast
from toys_keyword_data
union all
select ts as search_date, null as search_cnt, forecast
from toys_model_forecast;
Historical and predicted values for toys keyword search

The results appear underwhelming. Although the FORECAST function predicted data for the next 6 months or 184 days, the predicted values are quite stable. We don’t see a peak during Black Friday or December, which we would expect.

The reason that we don’t see the outlying values is that the forecasting algorithm uses rolling averages of historical data to predict trends. It does recognize cyclic calendar features such as day of week and week of year from timestamp data, but it does not predict based on outliers.

To improve the model, we will include exogenous data, also referred to as features, that might have influenced the target value. In our case these features are Amazon Prime day, Black Friday sale, and Christmas shopping. Let’s build an improved model with these additional features.

Forecasting with features

Let’s enhance the historical data with additional features. We will use the query that we used to create the TOYS_KEYWORD_DATA historical data, but we will add three additional features:

  • AMAZON_PRIME_DAY_FLG — a flag to indicate that the Amazon Prime day sale took place on that date
  • BLACK_FRIDAY_FLG — a flag to indicate the days leading up to and including the Black Friday sales
  • CHRISTMAS_SHOPPING_FLG — a flag to indicate the Christmas shopping period, roughly between December 5 and December 23
-- create a view that represents historical data for toys keyword search
-- adding features to indicate Amazon Prime day, Black Friday and Christmas shopping
create or replace view toys_data_with_features as
select
date::TIMESTAMP_NTZ AS search_date,
sum(calibrated_visits) AS search_cnt,
case
when date = '2021-06-21' then 1
else 0
end as amazon_prime_day_flg,
case
when date >= '2021-11-21' and date <= '2021-11-26' then 1
else 0
end as black_friday_flg,
case
when date >= '2021-12-05' and date <= '2021-12-23' then 1
else 0
end as christmas_shopping_flg
from ONSITE_SEARCH.DATAFEEDS.ON_SITE_SEARCH
where oss_keyword ilike '%toys%'
group by search_date, amazon_prime_day_flg, black_friday_flg, christmas_shopping_flg
;

Just like earlier, we can train a model on this historical data. Although we now have additional features, the parameters that we pass to the FORECAST function are the same. The model will treat all columns that are not the timestamp or the target value as additional features.

-- train the model on historical data
create SNOWFLAKE.ML.FORECAST toys_model_with_features(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'toys_data_with_features'),
TIMESTAMP_COLNAME => 'search_date',
TARGET_COLNAME => 'search_cnt'
);

Just like before, the trained model is stored as an object named TOYS_MODEL_WITH_FEATURES in the current schema.

To predict future values, we will call the FORECAST function of this model. Since we have additional features, we can’t call the model like earlier by specifying the number of time periods in the future. Instead, we must prepare data that represents future features during the time period for which we want to predict values.

We still want to predict values for the next 6 months or 184 days like earlier, but we will now construct a view named FUTURE_FEATURES with future features for the desired number of periods.

We can use the GENERATOR function to generate 184 rows, assign a date to each of the rows and add flags to indicate that Amazon Prime day will be between July 11 and 13 in 2022, Black Friday will be between November 20 and 25, and Christmas shopping between December 5 and 23.

-- create a view with future features for the forecasting period
create or replace view future_features as
select
dateadd(
'day',
row_number() over (partition by null order by null),
'2022-06-30'
) as search_date,
case
when search_date >= '2022-07-11' and search_date <= '2022-07-13' then 1
else 0
end as amazon_prime_day_flg,
case
when search_date >= '2022-11-20' and search_date <= '2022-11-25' then 1
else 0
end as black_friday_flg,
case
when search_date >= '2022-12-05' and search_date <= '2022-12-23' then 1
else 0
end as christmas_shopping_flg
from table (generator(rowcount => 184));

To predict future values, we will again call the FORECAST function of the TOYS_MODEL_WITH_FEATURES model, but instead of providing the number of time periods, we will provide the view that contains future features and the name of the timestamp column. The FORECAST function will know for how many periods to forecast values based on the number of periods that are included in the FUTURE_FEATURES view. Any column that is not the timestamp column in this view is considered a feature column.

-- call the model to predict future values
call toys_model_with_features!FORECAST(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'future_features'),
TIMESTAMP_COLNAME =>'search_date'
);

Like previously, we will save the output of the FORECAST function to a table named TOYS_MODEL_FORECAST_WITH_FEATURES so that we can analyze and plot it later:

-- save the results of the FORECAST output as a table
create or replace table toys_model_forecast_with_features as
select * from table(result_scan(last_query_id()));

Then we will construct a query that combines historical and predicted values and display it as a chart in Snowsight:

-- combine historical and predicted values in a single query
select search_date, search_cnt, null as forecast
from toys_data_with_features
union all
select ts as search_date, null as search_cnt, forecast
from toys_model_forecast_with_features;
Historical and predicted values for toys keyword search with features

As we can see, the predicted values are better than in the first model. The predicted values in December are considerably higher because of the Christmas shopping period. The model didn’t do so well in predicting Amazon Prime day and Black Friday sale, because these happen on a limited number of days that don’t affect rolling averages as much as the Christmas shopping period in December.

Another useful feature of the forecasting models is that it can explain the relative importance of all features used in the model. To view the feature importance of the TOYS_MODEL_FORECAST_WITH_FEATURES model, we can call the EXPLAIN_FEATURE_IMPORTANCE function and examine the output:

-- explain feature importance
call toys_model_with_features!EXPLAIN_FEATURE_IMPORTANCE();
Output of the EXPLAIN_FEATURE_IMPORTANCE function

As we can see, the model inferred that the frequency of the historical data repeating time periods is 7 days which makes sense, since shopping does usually have different trends on different days of the week and follows a weekly cadence. It thus makes sense to train the model on variables with 7 days’ lag. This also helps to explain why the December Christmas shopping period predictions were better — because the interval is larger than 7 days, indicating that the trend lasts longer — as compared to Amazon Prime day and Black Friday sales which each take less than a week and therefore have less impact within the 7 day interval.

Final thoughts

As is the case with machine learning, we can do much more to improve the accuracy of the model. We must decide how to treat outliers, how to deal with missing values, how to detect and resolve data anomalies, and so on. We can split the data into training and validation data sets to validate the model against known outcomes so that we can be confident that the model is performing well. We can add additional features to improve the accuracy of the model.

Snowflake is constantly rolling out new features that enable not only data scientists, but a wider audience to work with machine learning. Some of the features that were announced at Snowday 2023, such as the Snowpark Model Registry, the Snowflake Feature Store, and Snowflake Notebooks, to name a few, will make machine learning in Snowflake even easier.

--

--