Supervised Anomaly Detection: A Better Way to Model Anomalies

9 min readJun 5, 2023

Standard anomaly detection models are hard to evaluate and often fail to reliably catch anomalies. Try this new supervised approach that overcomes both of these issues.

Overview

Anomaly detection modeling is a subset of unsupervised machine learning. It’s unsupervised since there’s no predetermined target or “ground truth” that we can train our model to predict.

In other words, there are no historical answers that we can learn from like we ordinarily would with supervised machine learning. Instead, our goal is to identify potentially anomalous datapoints without any real guidance or pre-existing criteria that explicitly define anomalies.

There are various anomaly detection models that apply a wide range of methodologies to spot outliers. Each of these models have different methods for identifying these anomalies. Some of the most common approaches are isolation forest, mahalanobis distance, double median absolute distance (DMAD), local outlier factor anomaly detection, and autoencoders.

How They Work

The majority of these approaches tend to work one of two ways: they either flag univariate outliers (meaning individual values that are sufficiently rare or far away from their distribution’s center) or multivariate outliers (meaning combinations of values that are uncommonly encountered in the dataset).

While these approaches can work well in certain instances, they can also miss out on a number of cases that fall outside of these scenarios. Unfortunately, in the past whenever I’ve needed to identify anomalies in practice, neither of these approaches have worked particularly well.

Challenges

So why do standard anomaly detection models tend to fall short? In my opinion, there are two primary reasons:

The definition of what constitutes an anomaly as far as the use case is concerned rarely aligns with the types of anomalies these models were designed to catch
Without ground truth, evaluating the effectiveness of these approaches is unavoidably subjective and difficult to quantify

Let me spell each of these challenges out one by one.

What Does it Mean for Something to be Anomalous?

The answer to this question depends on the use case. Let’s imagine I’m trying to flag anomalous transactions as part of a know-your-customer (KYC) risk scoring process for a financial institution. In this scenario, I’m probably looking to identify transactions that may be fraudulent. Since fraudsters tend to change their techniques over time I likely want to look for uncommon or unusual transactions that help me better identify fraud.

In this example, I’m not strictly looking for uncommon or unusual patterns. I wouldn’t just want to flag clients with abnormally high transaction amounts. Some people earn a lot of money! It may be rare, but it’s not necessarily fraudulent.

Similarly, I also wouldn’t just want to flag examples of high activity around Christmas time as anomalous simply because this type of activity only happens once per year. People buy gifts, especially around the Holidays! If anything, I’d probably be better off flagging individuals that don’t spend more money around the Holidays.

As a final example of some of the issues I’ve encountered with standard anomaly detection techniques, let’s assume we work for a large Telco and want to predict anomalous viewership data. For this problem, I’m interested in catching issues where our data feeds break and pass too few or too many impressions.

Standard anomaly detection models will tend to flag events like the Super Bowl or the Grammy’s as highly anomalous since millions of viewers tune in. This isn’t wrong, but it’s not the type of anomaly we’re looking to flag. Quite the opposite actually. We want a model that would predict an anomaly if only a few people were to tune in to these highly popular events!

How can we Measure Performance without Actuals?

As mentioned previously, evaluating the performance of unsupervised anomaly detection models can be a challenge due to the absence of labeled ground truth. Still, there are a few techniques you can use to get a rough understanding of how well your model is performing.

Here are a few of them:

Visual Inspection: Although this is not a quantitative measure, visually inspecting all of the data points with the highest and lowest anomaly scores can help give you a good sense of whether the model is suitable.
Synthetic AUC: The metric works by generating two synthetic datasets out of the validation sample — one made more normal, one made more anomalous. Both samples are labelled accordingly, and then a model calculates anomaly score predictions for both samples. The usual ROC AUC value is estimated for each synthetic dataset, using artificial labels as the ground truth. It works like this:

Unfortunately, both of these approaches have their drawbacks. In the next section, we’ll explore how to overcome many of them by introducing a target variable. This way we can employ our tried-and-true supervised modeling techniques.

Introducing Supervised Anomaly Detection

The only thing that’s missing from our supervised equation is a target. Once we create one, we can simply throw any old supervised modeling approach at our problem and let it work it’s magic.

Our question then is “how do we create a target that captures anomalies?I’m glad someone asked. Here’s how it works:

Start with your original unlabeled dataset
Add a new target column called “Shuffled” that is full of zeros
Make a copy of the original dataset
Loop through each column in the dataset and randomly reorder the rows
Add a new target column called “Shuffled” that is full of ones
Stack the two datasets on top of one another to create your final labeled dataset

It’s a straightforward as that. Six simple steps to get a modeling ready dataset you can train any number of binary classification models against!

An Example: Credit Card Application

Now it’s time to try this out with a real world dataset. We’re going to use a credit card approvals dataset provided by Kaggle here.

First, we need to run the following code to get our new dataframe with shuffled rows for each column like so:

import pandas as pd

def shuffle_dataframe(df):
    """
    Function to shuffle each column of a pandas DataFrame independently.
    This can be useful in certain statistical analyses or in the creation of anonymized data sets.
    
    Parameters:
    df (DataFrame): The input pandas DataFrame.
    
    Returns:
    DataFrame: The shuffled DataFrame.
    """
    
    # Initialize an empty DataFrame to store shuffled data
    shuffled_df = pd.DataFrame()

    # Loop over each column in the input DataFrame
    for column in df.columns:
        # Use sample with frac=1 to shuffle the column, and reset_index to reset the index to match the original data
        shuffled_df[column] = df[column].sample(frac=1, random_state=1).reset_index(drop=True)
    
    # Add our target column and fill it with ones
    shuffled_df['Shuffled'] = 1

    return shuffled_df

Next, we concatenate this new dataframe with our original dataframe to get our modeling ready dataset with our newly created target variable.

Finally, we build a modeling pipeline to predict whether each row was shuffled. For this dataset, we’re going to run a gradient boosted model in python with some minimal pre-processing:

Once the model is trained, we can take a look at feature impact. Interestingly, the count of family members, count of children, and number of days employed are the three most important features.

While this graphic is informative, it really doesn’t help us understand how these features are important, it just tells us whether they are important.

To get a better sense of how these features relate to potential anomalies, we can dig a layer deeper and take a look at Shapely values. Shapely values are row-level prediction explanations that detail how each feature contributes to the overall row’s anomaly score. We can overlay these explanations on top of our dataframe:

My favorite part about these Shapely colored dataframes are how easy they make it to visualize interactions. Take a look at the NAME_INCOME_TYPE column. The fact that “Commercial associate” is sometimes red (third row) and sometimes blue (third to last row) means that this feature must be interacting with other features.

In this case, it makes perfect sense. The commercial associate applicant making $135,000/year is less likely to be an anomaly, whereas the commercial associate making $279,000/year is more likely to be anomalous.

Two-way Partial Dependence Plots

We can explore the interaction our model learned even more by plotting two-way partial dependence plots.

Before we even dive in to two-way plots, let’s start with a single partial dependence curve for income.

The solid black line is our partial curve. The lighter grey lines are our individual conditional expectation (ICE) curves. We can see that the likelihood a credit card application is anomalies decreases as applicants get older, until they hit their 40’s, after which the likelihood starts to increase.

Interestingly, even though their is a clear U-shaped curve there’s lots of variance. In technical speak, the curves are jumping up and down like House of Pain.

Strong interactions could definitely be causing all this commotion. Let’s see if splitting our partial curve by occupation type can help explain some of this variance:

Turns out they absolutely can. As applicants earn higher incomes, the probability we flag their applications as anomalous increases more if they’re employed compared to any other occupation.

In layman’s terms, working as a laborer isn’t necessarily anomalous. Working as a laborer making nearly half a million dollars a year however…

Let’s try looking at other combinations of features. How about age and housing type:

This is a fun one as well. Starting around age 35, as applicants get older our model predicts their applications are more likely to be anomalous if they still live with their parents. Unless you’re Will Ferrell in Wedding Crashers, this is probably true.

Alright, let’s look at one final example with occupation and gender:

Seeing female applicants listing driver as their occupation is more anomalous than male drivers. I’m no labor expert, but I could believe that women are less likely to work as drivers.

Extra Credit

Anomaly detection modeling is tricky. Even by applying the supervised approach I mentioned above, you can still miss anomalies.

Typically, with these types of problems, reducing false negatives tends to carry far more weight than reducing false positives. In other words, it’s more important to correctly identify as many anomalies as possible, even if that means we incorrectly flag more non-anomalies as anomalous.

In my experience, the best way to achieve this goal is combine both of our models into an ensemble. As mentioned previously, all of these anomaly detection models have their own ways of flagging anomalous data. Ex-ante it is difficult to know which of these methods best aligns with our problem’s definition of an anomaly. In fact, it may even be the case that none of the models perfectly flag anomalies the way we want.

Since we want to be conservative, our ensemble can just take the max score across our underlying models. This way we capture every anomaly our models flag, rather than letting some slip through the cracks.

That’s a wrap for this post. Give this technique a shot and let me know how it goes. Follow me on Medium and LinkedIn for more helpful data science tips and tricks. Thanks for reading!