Empowering Analysts and Business Users : Bridging the Gap with Snowflake ML Powered Anomaly Detection Function

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

7 min readOct 30, 2023

Machine learning (ML) has undoubtedly transformed the way data is analyzed, insights are derived, and informed decisions are made. It has revolutionized the field of data analysis, opening up new possibilities and opportunities. In Part-1 of this series on Anomaly Detection for Time Series data, an unsupervised learning for detecting Anomalies in a Manufacturing setting was analyzed. Many times, with its technical complexity, ML can sometimes pose a challenge for non-technical individuals, such as data analysts and business professionals who may have limited ML skills and want to carry initial exploration and analysis. However, there is a solution to bridge this gap and empower users to harness the power of ML without the need for extensive technical knowledge.

ML-powered functions from Snowflake is the answer to this challenge. These functions are designed to be user-friendly and provides a simplified approach to utilize ML capabilities. Compared to the previous approach, which requires deep technical expertise in Machine Learning algorithms best suited to the data and an entire data science team to implement, now by leveraging ML-powered functions data analysts and business professionals can easily apply handy functions that wraps the ML algorithms and techniques to their data thus gaining valuable insights and making informed decisions in under 10 minutes. This help to efficiently detect anomalies in their data, uncovering hidden patterns and anomalies that may have otherwise gone unnoticed.

In this blog post, we will delve into the ML-powered functions (in Public Preview) specifically developed for Anomaly Detection. There are other ML-powered functions like Forecasting, Budget which is not in scope here. We will approach solving this like a typical Machine Learning Project.

Data Preprocessing and Exploration

We will be using an open source NASA Bearings sensor multivariate time series dataset that was employed in the Part-1 of this blog as well. A Multivariate time series refers to a set of time series variables where each variable is not only dependent on its past values but also has some level of dependence on other variables. This dependency is utilized to forecast future values. When dealing with multivariate time series data, it is important to understand the interdependencies between the variables. By analyzing how each variable influences the others, we can uncover valuable insights and make more accurate predictions. For example, in the given dataset, the perspiration percent, dew point, wind speed, and cloud cover percentage can all have an impact on the temperature. By considering these variables together, we can optimize our temperature predictions.

Though multivariate time series is not supported, the good thing with this SQL function is that we can carry Anomaly Detection for a single series or multiple series (multiple columns combined as a variant into a single column or just a simple univariate analysis).

First lets quickly look at this data which is not labeled in this case. There are 4 sensors readings and one timestamp column. This data was straightforward in 10-minute intervals and absence of categorical columns. Since these are simple SQL functions and statements, all the operations are carried in Snowsight. This makes it even for friendly for a data analyst and from a business user perspective.

We are not carrying any Feature Engineering for this dataset, but some may require that. The algorithm does not rely on one-hot encoding when training on categorical features, so you can use categorical data with many dimensions (high cardinality).

Create 2 views containing the train and test data. The sensor data is present in SENSOR_PREPARED table.

create or replace view sensor_vw_train as select * from SENSOR_PREPARED
where MEASURE_TS<'2023-02-15 11:00:00.000'

create or replace view sensor_vw_test as select * from SENSOR_PREPARED
where MEASURE_TS>='2023-02-15 11:00:00.000'

Model Training

Snowflake has streamlined the process by selecting the ideal model for you. It’s worth noting that our anomaly detection algorithm leverages a gradient boosting machine (GBM). Similar to an ARIMA model, it employs a differencing transformation to handle data with a non-stationary trend and uses auto-regressive lags of the historical target data as model features.

To create an object that you use to detect anomalies we use the sql function SNOWFLAKE.ML.ANOMALY_DETECTION

CREATE OR REPLACE SNOWFLAKE.ML.ANOMALY_DETECTION admodel_for_sensordetection(
  INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'sensor_vw_train'),
  TIMESTAMP_COLNAME => 'MEASURE_TS',
  TARGET_COLNAME => 'TEMPERATURE',
  LABEL_COLNAME => '');

In this scenario, I would like to identify any abnormalities originating from a single sensor, specifically the temperature sensor (single series) in this case. If you have labeled data that includes hand-labeled anomalies in your training dataset you can pass that, and it will help ensure that future anomalies are more accurately identified. I will use a standard warehouse as my training job is processing a small size dataset and there are no exogenous features. In case of large datasets, Snowflake suggests using a Snowpark-optimized warehouse for larger training jobs.

Detection

Now that the train and test datasets along with the model have been created, the last step is using <model_name>!DETECT_ANOMALIES to detect anomalies. Before that lets create a table to persist the results and thus get the ability to analyze the data when needed.

create or replace table anomaly_detection_results (
    ts timestamp_ntz,
    y float,
    forecast float,
    lb float,
    ub float,
    is_anomaly boolean,
    percentile float,
    distance float
);

where, ts- Measure Timestamp

y — value for the time series

IS_ANOMALY- True if the value is an anomaly; False if not

lb — lower bound of the value within the prediction interval

ub — upper bound of the value within the prediction interval

percentile - percentile of the prediction interval for the Y value

distance — standard deviation from the FORECAST column (z-score)

The inference step generally takes around 1 second to process 10 rows in the input dataset, and this is independent of the warehouse size. Therefore, it’s important to note that a larger warehouse won’t necessarily yield faster results. Moreover, by specifying the size of your prediction interval, you can forecast a range where a certain percentage of data is likely to be located between an upper and lower limit. Also, a narrower prediction interval signifies a stricter definition of “normal”, potentially leading to a higher detection of anomalies.

CALL admodel_for_sensordetection!DETECT_ANOMALIES(
  INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'sensor_vw_test'),
  TIMESTAMP_COLNAME => 'MEASURE_TS',
  TARGET_COLNAME => 'TEMPERATURE',
  CONFIG_OBJECT => {'prediction_interval':0.90});

insert into anomaly_detection_results (ts, y, forecast, lb, ub, is_anomaly, percentile, distance)
        select * from table(result_scan(last_query_id()));

The last_query_id gives the ID of a last query in the current session which in this case is the one corresponding to invoking the <model_name>!DETECT_ANOMALIES execution and the RESULT_SCAN returns the result set of a previous command.

Last but not the least. An automated pipeline for training on new data and detecting anomalies on a batch pipeline can further be created. The below Snowflake Task automates this pipeline for ongoing training and another separate task for inference as needed.

CREATE OR REPLACE TASK ad_model_training_task
WAREHOUSE = ML_WH
SCHEDULE = '3600 MINUTE'
AS
EXECUTE IMMEDIATE
$$
BEGIN
  CREATE OR REPLACE SNOWFLAKE.ML.ANOMALY_DETECTION admodel_for_sensordetection(
  INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'sensor_vw_train'),
  TIMESTAMP_COLNAME => 'MEASURE_TS',
  TARGET_COLNAME => 'TEMPERATURE',
  LABEL_COLNAME => '');
END;
$$;

It only takes 10 minutes to set this up, and guess what?

If you have an extra 5 minutes, you can quickly visualize it using a stunning chart. As you may recall from the previous post, my initial data points were collected from a sensor that was functioning properly before it gradually started to deteriorate. Voila! The ML Function accurately detected it! It has highlighted the anomalies as TRUE (blue line).

Limitations (at the time of writing this):

Keeping in mind that the Anomaly Detection SQL function is aimed for quick exploration and initial analysis, there are some reasonable sets of limitations

The algorithm does not support multivariate anomaly detection.
Point anomalies are often not detected.
Models cannot be trained on more than 500,000 rows of data.
Timestamps in your data must represent fixed time intervals.
The minimum acceptable granularity of data is one second. When data is at a sub-second level, group it and bring up the grain to one second in this case.
The model does not specifically detect change points. This functionality depends on the structure of the training data. If you train the model before the change point occurs, it is likely to identify the change point. However, if the training data includes a change point, the model is likely to interpret it as a “new norm” and will not flag future data with similar values.

Summary

By utilizing Snowflake’s machine learning-powered functions for Anomaly Detection, data analysts and business professionals can fully leverage their data, enabling them to make informed decisions based on data. Whether you are new to machine learning or an experienced practitioner, Snowflake offers a comprehensive set of powerful tools to enhance rapid data analysis and decision-making processes. To complement the overall ML methodology, Snowflake has recently introduced several new machine learning capabilities that further streamline the development cycles for data scientists and machine learning engineers.

ML Powered Functions is an Open Preview feature at the time of writing this. However, due to their user-friendly nature and the wide range of applications for different types of users, I didn’t want to wait any longer to write about it.

Stay tuned for Part -3 where we will see how to train a Anomaly Detection model using one of the Snowflake ML Modeling class, deploy to Model Registry and build a DAG pipeline.