Detecting Anomalies with Snowflake Cortex ML-based Functions

Using anomaly detection to identify outliers and alert when time series data deviates from the usual trends

Photo by Kai Pilger on Pexels

In my previous post about Forecasting with Snowflake Cortex ML-based Functions, I explored how to use the forecasting functionality in SQL from Snowsight worksheets. Snowflake Cortex ML-based functions can be used by anyone, not just machine learning experts.

In this post, I want to look at the Anomaly Detection functionality, which is also a part of Snowflake Cortex ML-based functions as documented here. This functionality is related to forecasting, but goes further by letting you know whether any forecast values are outliers from the expected value.

I used the New York City yellow taxi trip records from the TLC Trip Record Data as my sample data. These records include columns like pick-up and drop-off dates and times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. I downloaded parquet files for three months, from October 2023 until December 2023.

Preparing the environment and sample data

To prepare an environment for testing the functionality, I created a custom role, a warehouse, a database, a schema, a stage, and granted privileges:

-- create a custom role
use role useradmin;
create role data_scientist;

-- grant the role to myself so I can use it
grant role data_scientist to user <my current user>;

-- create a warehouse, database, schema, and grant privileges
use role sysadmin;
create database ml_anomaly_db;
grant usage on database ml_anomaly_db to role data_scientist;
create schema nyctaxi;
grant all on schema nyctaxi to role data_scientist;
create warehouse data_science_wh with warehouse_size = 'xsmall';
grant usage on warehouse data_science_wh to role data_scientist;

-- grant the create anomaly detection privilege
grant create SNOWFLAKE.ML.ANOMALY_DETECTION
on schema ml_anomaly_db.nyctaxi
to role data_scientist;

-- continue working with the newly created role
use role data_scientist;
use warehouse data_science_wh;
use database ml_anomaly_db;
use schema nyctaxi;

-- create an internal stage
create stage nyctaxi_stg;

I manually uploaded the three yellow taxi trip record files (yellow_tripdata_2023–10.parquet, yellow_tripdata_2023–11.parquet, yellow_tripdata_2023–12.parquet) to the nyctaxi_stg internal stage via the Snowsight user interface.

Then I created a parquet file format, a target table to store the data from the parquet files using the infer_schema function, and copied the data from the files into the table:

-- create a file format
create or replace file format my_parquet_format type = 'parquet';

-- create the target table using the INFER_SCHEMA function
create table yellow_taxi_stg
using template (
select array_agg(object_construct(*))
from table(
infer_schema(
location => '@nyctaxi_stg',
file_format => 'my_parquet_format',
ignore_case => TRUE
)
)
);

-- copy the data from the staged files into the target table
copy into yellow_taxi_stg
from @nyctaxi_stg
match_by_column_name = case_insensitive
file_format = 'my_parquet_format';

After a quick look at the data, I had to do some cleansing before I could use it, like converting the pick-up and drop-off datetime columns to the Snowflake timestamp data type. Although the data I downloaded was for October, November, and December 2023, there were some stray records with dates outside of this time period, so I removed them:

-- clean the data
create or replace table yellow_taxi as
select * exclude (tpep_pickup_datetime, tpep_dropoff_datetime),
to_timestamp(tpep_pickup_datetime/1000000) as pickup_ts,
to_timestamp(tpep_dropoff_datetime/1000000) as dropoff_ts
from yellow_taxi_stg
where to_date(pickup_ts) >= '2023-10-01'
and to_date(dropoff_ts) >= '2023-10-01'
and to_date(pickup_ts) <= '2023-12-31'
and to_date(dropoff_ts) <= '2023-12-31';

Using the clean data, I created a visualization of the fare amount by drop-off date:

-- summarize fare amount by drop-off date
select to_date(dropoff_ts) as dropoff_dt,
sum(fare_amount) as sum_fare_amount
from yellow_taxi
group by dropoff_dt;
Fare amount by drop-off date

From the diagram, it’s easy to spot a weekly seasonality in the data. There is a noticeable decline in the fare amount, indicating fewer taxi trips, on Thanksgiving (November 23) and Christmas (December 25).

Unsupervised Anomaly Detection

To demonstrate the anomaly detection functionality, I will start with the unsupervised approach where I don’t know anything about existing anomalies in my data.

I will take the fare amount by drop-off date data from October 2023 and November 2023 as training data. Then I will use the anomaly detection functionality to forecast the fare amount for December 2023, and at the same time compare the forecast values with the actual values and flag any anomalies.

First, I will create two views, called yellow_taxi_historical_data, containing the fare amount summarized by drop-off day from October 2023 and November 2023, and yellow_taxi_new_data, containing the fare amount summarized by drop-off day from December 2023:

-- historical data
create or replace view yellow_taxi_historical_data as
select to_date(dropoff_ts)::timestamp as dropoff_dt,
sum(fare_amount) as sum_fare_amount
from yellow_taxi
where to_date(dropoff_ts) <= '2023-11-30'
group by dropoff_dt;

-- new data
create or replace view yellow_taxi_new_data as
select to_date(dropoff_ts)::timestamp as dropoff_dt,
sum(fare_amount) as sum_fare_amount
from yellow_taxi
where to_date(dropoff_ts) >= '2023-12-01'
group by dropoff_dt;

Next, I will train an anomaly detection model named nyctaxi_model. The model takes the view containing the historical data (October 2023 and November 2023) as the input data. Just like the forecasting model, it requires a timestamp column, in this case the drop-off date, and a column containing the target data to analyze, in this case the fare amount. The label_colname parameter is not used in this case because I’m working with an unsupervised model:

-- train model
create or replace SNOWFLAKE.ML.ANOMALY_DETECTION nyctaxi_model(
input_data => SYSTEM$REFERENCE('VIEW', 'yellow_taxi_historical_data'),
timestamp_colname => 'dropoff_dt',
target_colname => 'sum_fare_amount',
label_colname => '');

The trained model is stored as an object in the current schema.

I will now use this model to predict future values and flag any anomalies by calling the DETECT_ANOMALIES function. The parameters to this function are the view containing the new data (December 2023), and again the names of the timestamp and target data columns to analyze. The output of the model is displayed in the worksheet after executing the command. To save the output to a table so I can view it later, I will create a table named nyctaxi_model_anomalies:

-- calculate anomalies on new data
call nyctaxi_model!DETECT_ANOMALIES(
input_data => SYSTEM$REFERENCE('VIEW', 'yellow_taxi_new_data'),
timestamp_colname =>'dropoff_dt',
target_colname => 'sum_fare_amount'
);

-- save the output to a table
create or replace table nyctaxi_model_anomalies as
select * from table(result_scan(last_query_id()));

The output of the DETECT_ANOMALIES function contains columns that are similar as those returned from the FORECAST function used with the forecasting model: the timestamp of the predicted value, the predicted value, the prediction lower bound, and the prediction upper bound, as well as an additional Boolean column named is_anomaly. This column contains a TRUE or FALSE value, depending on whether the model considers the data point an anomaly.

To understand the result of the model, I will plot the historical values, the new values, the model predicted values, and the anomaly flag by constructing the following query:

-- combine historical and predicted values in a single query
select
to_date(dropoff_ts) as dropoff_dt,
sum(fare_amount) as sum_fare_amount,
null as forecast,
null as is_anomaly
from yellow_taxi
group by dropoff_dt
union all
select
ts as dropoff_dt,
null as sum_fare_amount,
forecast,
-- using a value high enough that it is visible on the diagram
case when is_anomaly then 1000000 end as anomaly_flg
from nyctaxi_model_anomalies;
Fare amount by drop-off date (actual, forecast, and anomaly)

Let’s see what the diagram shows. The blue line represents the actual values of the fare amount by day for the three months. The yellow line represents the forecast value as calculated by the model. The turquoise line indicates the anomalies, where TRUE is replaced by a value high enough so that it is visible on the diagram.

The model flagged only one date, December 25, 2023, as an anomaly. This makes perfect sense. As is visible on the diagram, there was a significant decline in the fare amount on this day. The model didn’t predict this decline because it didn’t know about the Christmas holiday on December 25. But it noticed that the value deviates significantly from the predicted value and marked it as an anomaly.

Supervised Anomaly Detection

Looking at the diagram more closely, there is a decline in the fare amount in the historical data (blue line) on November 23, 2023, which was the Thanksgiving holiday:

Decline in the fare amount on November 23, 2023

To improve the accuracy of the anomaly detection model, I will now label the training data by adding an additional Boolean column that indicates whether the record is a known anomaly, like the Thanksgiving holiday. This will help the anomaly detection model to avoid overfitting to such anomalies.

I will create a new view named yellow_taxi_historical_data_with_labels for the historical data just like the yellow_taxi_historical_data, adding a column named label that has a value of TRUE on November 23 and a value of FALSE otherwise:

-- historical data with labels

create or replace view yellow_taxi_historical_data_with_labels as
select to_date(dropoff_ts)::timestamp as dropoff_dt,
sum(fare_amount) as sum_fare_amount,
case when to_date(dropoff_ts) = '2023-11-23'
then TRUE
else FALSE
end as label
FROM yellow_taxi
WHERE to_date(dropoff_ts) <= '2023-11-30'
group by dropoff_dt, label;

Then I will train a new model called nyctaxi_model_with_labels using the historical table with labels:

-- train model
CREATE OR REPLACE SNOWFLAKE.ML.ANOMALY_DETECTION nyctaxi_model_with_labels(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'yellow_taxi_historical_data_with_labels'),
TIMESTAMP_COLNAME => 'dropoff_dt',
TARGET_COLNAME => 'sum_fare_amount',
LABEL_COLNAME => 'label');

Same as previously, I will use this model to predict future values and flag any anomalies by calling the DETECT_ANOMALIES function, using the view containing the new data (December 2023). I will save the output of the model to a table named nyctaxi_model_anomalies_with_labels:

-- calculate anomalies on new data
call nyctaxi_model_with_labels!DETECT_ANOMALIES(
input_data => SYSTEM$REFERENCE('VIEW', 'yellow_taxi_new_data'),
timestamp_colname =>'dropoff_dt',
target_colname => 'sum_fare_amount'
);

-- save the output to a table
create or replace table nyctaxi_model_anomalies_with_labels as
select * from table(result_scan(last_query_id()));

Once again I will plot the historical values, the new values, the model predicted values, and the anomaly flag of both the historical data and the predicted data by constructing the following query:

-- combine historical and predicted values in a single query
select to_date(dropoff_ts) as dropoff_dt,
sum(fare_amount) as sum_fare_amount,
null as forecast,
case when to_date(dropoff_ts) = '2023-11-23' then 1000000 end as is_anomaly
from yellow_taxi group by dropoff_dt
union all
select ts as dropoff_dt,
null as sum_fare_amount,
forecast, case when is_anomaly then 1000000 end as anomaly_flg
from nyctaxi_model_anomalies_with_labels;
Fare amount by drop-off date (actual, forecast, and anomaly)

The new model is still flagging December 25, 2023 as an anomaly, which is correct as this is indeed an anomaly. But by providing labeled historical data where I indicated that November 23, 2023 is an anomaly, the model predicted somewhat different values. To compare the details, I constructed a query that combines the forecast values from both models, the unsupervised one and the supervised model where historical data is labeled with anomalies, and displayed the values on a diagram:

select ts as dropoff_dt, 
forecast as sum_fare_amount_unsupervised,
null as sum_fare_amount_with_labels
from nyctaxi_model_anomalies
union all
select ts as dropoff_dt,
null as sum_fare_amount_unsupervised,
forecast as sum_fare_amount_with_labels
from nyctaxi_model_anomalies_with_labels;
Comparison of forecast values between an unsupervised and a labeled model

The diagram shows that the forecast values using the unsupervised model (blue line) shows a much lower value on December 22, because of the low value on November 23. But the forecast values from the model with labeled data (yellow line) are higher, indicating that the effect of the labeled anomaly is not as significant.

Practical use

Now that I demonstrated the anomaly detection functionality, let’s discuss how it differs from the forecasting functionality. In forecasting, we use historical time series data to predict future values. In anomaly detection, we also use historical time series data to predict future values, but the future values are already known so that the model can compare the predicted and the actual values and flag any outliers or inconsistencies, referred to as anomalies.

What is the practical use of anomaly detection? In data warehousing, we often ingest data that can be expressed as time series, for example:

  • Banking: number of transactions by various categories, transaction amounts, etc.
  • Telecommunications: number of phone calls, duration of phone calls, amount of data transfer, etc.
  • Retail: number of sales, amount of sales, stock levels, etc.
  • Manufacturing: number of produced parts, number of rejects, etc.

When ingesting large amounts of data frequently, we want to be confident that all data has been ingested and we haven’t missed or duplicated any. The traditional way of automatic testing usually involves summarizing the values by a chosen time period and comparing the values with the preceding time period. Assuming that the data distribution is continuous, we can alert when the difference between the value between two consecutive time periods exceeds a predefined percentage. This works well when the data has a smooth continuous distribution.

When the data has more seasonality and fluctuating trends, this approach doesn’t always work and it may produce too many false positives by alerting even when the data is fine. By using the anomaly detection algorithm, we can verify that the ingested data in the recent time period matches the expected values as calculated by the model based on historical values, giving more accurate results and more true positives when alerting due to anomalies.

Another valuable application of the anomaly detection algorithm is to monitor the Snowflake usage in your account and alert when the usage deviates from the predicted values based on past trends.

I’m Maja Ferle, Snowflake Data Superhero and a senior consultant at In516ht. You can get in touch with me on LinkedIn.

--

--