No Data Science Team, No Problem

Kenny Nagano
10 min readSep 13, 2023

--

Leveraging Machine Learning for Time Series Forecasting , Anomaly Detection and data contribution in Snowflake

Have you ever wondered how machine learning can enhance your data analysis without requiring you to become a machine learning expert? Snowflake’s ML-powered functions have the answer. These cutting-edge functions leverage the power of machine learning to uncover hidden patterns in your data effortlessly, making advanced insights accessible to anyone with data without the need for deep knowledge of machine learning algorithms. The best part? All you need is your data.

We’ll dive into Snowflake’s ML-powered functions and explore how they can revolutionize your data analysis. Whether you’re interested in forecasting future metrics, detecting anomalies, or understanding the unexpected contributors to your data, Snowflake’s ML-powered functions have you covered. Say goodbye to complex machine learning development and hello to data-driven insights. These features are currently available to all of our customers in a public preview capacity. Preview features are provided primarily for evaluation and testing purposes. They should not be used in production systems or with production data. Let’s get started!

In this article, we will explore the general principles of machine learning functions within Snowflake and provide a detailed examination of the forecasting function. Part 2, covering other machine learning functions in Snowflake, will be featured in a future article.

Time-Series Functions

One of the key features of Snowflake’s ML-powered functions is their ability to handle time-series data. Time series data involves tracking data points over time, making it essential for applications like sales forecasting, stock price predictions, and demand forecasting. Snowflake’s time-series functions use machine learning models to analyze historical data and provide valuable insights and predictions. Here are some of the functionalities they offer:

1. Forecasting: Forecasting predicts future metric values based on historical data trends. For instance, you can forecast future sales figures based on past sales data, taking into account seasonality and other factors.

2. Anomaly Detection: Anomaly detection is crucial for identifying irregularities in your data. ML-powered functions can flag metric values that deviate significantly from expected patterns, helping you catch issues early.

3. Contribution Explorer: This function helps you unearth dimensions and values that have unexpected impacts on your metrics. It’s a powerful tool for understanding the drivers behind your data.

Understanding the Forecasting Algorithm

In this section, we will explore some of the technical aspects for our readers who have a data science background, but the best part of Snowflake ML Functions is that you don’t necessarily need to understand this section to effectively use the Forecasting Function in Snowflake. That being said, it’s essential to understand how these functions operate.

Snowflake’s forecasting algorithm relies on a gradient boosting machine (GBM), a powerful machine learning technique. This algorithm uses historical target data and timestamp data to model non-stationary trends and auto-regressive lags. It even automatically generates cyclic calendar features like day of the week and week of the year from timestamps. You can also incorporate exogenous variables (additional influencing factors) into your model. You can also incorporate additional influencing factors (exogenous variables) into your model. These are factors that influence the target variable, but they are not directly measured. For example, you could use weather data, company holiday data, or economic indicators as exogenous variables.

Snowflake’s algorithm does not rely on one-hot encoding for categorical features. This makes it versatile and capable of handling high-cardinality data. Categorical data with many dimensions can be difficult to model with one-hot encoding, but the Snowflake algorithm can handle it without any problems.

In addition to forecasts, the Snowflake Forecasting Function also generates prediction intervals. A prediction interval is an estimated range of values within an upper bound and a lower bound in which a certain percentage of data is likely to fall. For example, a 0.95 value means that 95% of the data likely appears within the interval. The default prediction interval is 0.95, but you can specify a different value if you want. The lower and upper bounds of the prediction interval are returned as part of the forecast output.

Getting Ready for Forecasting

In preparation for using forecasting with Snowflake, there are several important steps to consider. First, you need to choose a suitable virtual warehouse for training and running your models. Additionally, you must grant the necessary privileges to create forecast objects. Modifying your search path to include SNOWFLAKE.ML is also recommended.

Selecting the right virtual warehouse is crucial for efficient forecasting. A Snowflake virtual warehouse provides the computational resources needed for training and using machine learning models, with a focus on the resource-intensive training step. For single-series data, the primary factor to consider is the number of rows in your training dataset. You can comfortably train a model with up to 150,000 rows on a standard warehouse of any size without encountering memory issues. For datasets with more than 150,000 but fewer than 500,000 rows, a Snowpark-optimized warehouse is suggested. However, datasets exceeding 500,000 rows cannot currently be utilized.

It’s worth noting that the size of the dataset does not significantly affect training time on standard or Snowpark-optimized warehouses. For instance, on an XS warehouse, most training workloads on datasets with 150,000 rows or fewer take only two or three minutes to complete, regardless of the dataset’s size or the number of exogenous variables. Larger warehouses also do not significantly reduce training time, but training time in Snowpark-optimized warehouses is roughly proportional to dataset size.

Regardless of the warehouse type, Snowflake recommends creating a dedicated warehouse for training your models to achieve optimal performance.

When working with multi-series data, training performance depends on several factors, including the number of rows, the number of time series in the dataset, and the size and type of the warehouse. Specific data points can help guide your selection, but it’s essential to remember that these times are approximate and may vary depending on your dataset. These time estimates encompass both the training and inference steps of the forecasting process. Here is a chart with rough estimates when working with multi-series data

After sizing the warehouse you will need to set the right privileges to use ML Functions. Training a forecasting model results in a schema-level object. Therefore, the role you use to create models must have the CREATE SNOWFLAKE.ML.FORECAST privilege on the schema where the model will be created, allowing the model to be stored there. This privilege is similar to other schema privileges like CREATE TABLE or CREATE VIEW.

Refer the docs to make sure roles have the right privileges for using ML Functions. The docs will provide scripts to grant the proper privileges.

https://docs.snowflake.com/en/user-guide/ml-powered-forecasting

Now let’s test out the forecasting function

In preparation for the upcoming examples, we’ve established two essential tables. First, there’s the ‘sales_data’ table, housing all our sales data, complete with store IDs, item identifiers, timestamps, sales amounts, and even exogenous factors like temperature, humidity, and holidays. Then, we have the ‘future_features’ table, which holds future values of these exogenous variables — a crucial component when conducting forecasts that take these factors into account. These tables serve as the foundation for the examples you’ll encounter later in this topic.

CREATE OR REPLACE TABLE sales_data (store_id NUMBER, item VARCHAR, date TIMESTAMP_NTZ,
sales FLOAT, temperature NUMBER, humidity FLOAT, holiday VARCHAR);

INSERT INTO sales_data VALUES
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-01'), 2.0, 50, 0.3, 'new year'),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-02'), 3.0, 52, 0.3, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-03'), 4.0, 54, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-04'), 5.0, 54, 0.3, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-05'), 6.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-06'), 7.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-07'), 8.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-08'), 9.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-09'), 10.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-10'), 11.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-11'), 12.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-12'), 13.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-01'), 2.0, 50, 0.3, 'new year'),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-02'), 3.0, 52, 0.3, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-03'), 4.0, 54, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-04'), 5.0, 54, 0.3, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-05'), 6.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-06'), 7.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-07'), 8.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-08'), 9.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-09'), 10.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-10'), 11.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-11'), 12.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-12'), 13.0, 55, 0.2, NULL);

-- future values for exogenous variables (additional features)
CREATE OR REPLACE TABLE future_features (store_id NUMBER, item VARCHAR,
date TIMESTAMP_NTZ, temperature NUMBER, humidity FLOAT, holiday VARCHAR);

INSERT INTO future_features VALUES
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-13'), 52, 0.3, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-14'), 53, 0.3, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-13'), 52, 0.3, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-14'), 53, 0.3, NULL);

Training, Using, Viewing, Deleting, and Updating Models

The first step is to use CREATE SNOWFLAKE.ML.FORECAST to create and train a model. The model is trained on the dataset you provide.

CREATE SNOWFLAKE.ML.FORECAST <model_name>(...);

See FORECAST for complete details about the SNOWFLAKE.ML.FORECAST constructor. For examples of creating a model, see Examples.

SNOWFLAKE.ML.FORECAST runs using limited privileges, so by default, it does not have access to your data. You must therefore pass tables and views as references, which pass along the caller’s privileges. You can also provide a query reference instead of a reference to table or a view.

To generate a forecast, call the model’s <model_name>!FORECAST method:

CALL <model_name>!FORECAST(...)

Producing Forecasts

After creating your model, you can produce a forecast with name!FORECAST(...). You can create forecasts from multiple datasets using the same model.

The following examples demonstrate how to create models and produce forecasts from them; consult the docs <model_name>!FORECAST for details on all the parameters.

To use a model with a dataset that has column names different from those in the dataset that was used to train the model, create a view or a query that renames the columns (using AS) to the names the model expects. Use that view or query as the input to the FORECAST method.

See FORECAST for information about the parameters used in creating and using a model.

Examples

The following examples demonstrate using time series forecasting for various use cases.

Forecasting on a Single Series

This example uses a single time series (that is, all the rows are part of a single series) that has two columns, a timestamp column and a target value column, without additional features. First, prepare the data set from sales_data to train the model. The following code creates a view named v1:

CREATE OR REPLACE VIEW v1 AS SELECT date, sales
FROM sales_data WHERE store_id=1 AND item='jacket';
SELECT * FROM v1;

The SELECT statement returns:

+-------------------------+-------+
| DATE | SALES |
+-------------------------+-------+
| 2020-01-01 00:00:00.000 | 2 |
| 2020-01-02 00:00:00.000 | 3 |
| 2020-01-03 00:00:00.000 | 4 |
| 2020-01-04 00:00:00.000 | 5 |
| 2020-01-05 00:00:00.000 | 6 |
+-------------------------+-------+

The following statement trains a forecasting model using the above dataset:

CREATE SNOWFLAKE.ML.FORECAST model1(INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v1'),
TIMESTAMP_COLNAME => 'date',
TARGET_COLNAME => 'sales'
);

The following message appears after the model is trained:

Instance MODEL1 successfully created.

The forecasting model is now available as model1. To forecast the next three timestamps:

call model1!FORECAST(FORECASTING_PERIODS => 3);

Output — Note that the model has inferred the interval between timestamps from the training data.

+-------------------------+-----------+--------------+--------------+
| TS | FORECAST | LOWER_BOUND | UPPER_BOUND |
+-------------------------+-----------+--------------+--------------+
| 2020-01-13 00:00:00.000 | 14 | 14 | 14 |
| 2020-01-14 00:00:00.000 | 15 | 15 | 15 |
| 2020-01-15 00:00:00.000 | 16 | 16 | 16 |
+-------------------------+-----------+--------------+--------------+

In this example, because the forecast yields a perfectly linear prediction that has zero errors compared to the actual values, the prediction interval (LOWER_BOUND, UPPER_BOUND) is the same as the FORECAST value.

To customize the size of the prediction interval, pass prediction_interval as part of a configuration object:

CALL model1!FORECAST(FORECASTING_PERIODS => 3, CONFIG_OBJECT => {'prediction_interval': 0.8});

To save your results directly to a table, use the following code:

EXECUTE IMMEDIATE $$
BEGIN
CALL model1!FORECAST(FORECASTING_PERIODS => 3);
LET x := SQLID;
CREATE TABLE x AS SELECT * FROM TABLE(RESULT_SCAN(:x));
END;
$$;

SELECT * FROM x;

Just like that we have created our first Snowflake Machine Learning Function. No data science background needed but leverage the power of Snowflake to help with your forecasting.

Challenges and Future Enhancements

While Snowflake’s ML-powered functions for time series forecasting are impressive, they do come with some limitations during this preview release:

  • You cannot customize the forecasting algorithm.
  • Models are capped at 500,000 rows of data.
  • Certain parameters like trend, seasonality, or seasonal amplitudes are inferred from the data.
  • Data granularity requirements exist.
  • Models are immutable and do not support versioning.

Cost Considerations

While ML-powered functions in Snowflake offer tremendous value, it’s essential to consider the associated costs. Using these functions incurs storage and compute costs that can vary based on factors like the specific function used and the volume of data used for training and predictions. To optimize costs, consider periodically deleting unused or outdated models.

Summary

In conclusion, Snowflake’s ML-powered functions empower you to extract valuable insights and make accurate predictions from your time series data without the need for advanced machine learning expertise. Whether you’re forecasting sales, detecting anomalies, or exploring data contributions, these functions provide a robust toolkit for data-driven decision-making. In my next article I will go into the anomaly detection and data contributions functions. Stay tuned for further refinements and enhancements as Snowflake continues to evolve its machine learning capabilities. Happy Searching!

--

--