Fighting Fraud with Anomaly Detection

Mars Xiang
The Startup
Published in
5 min readMay 23, 2020
Source: trajaner on Pixabay

What could you do with 800 million dollars? What could the government do with 800 million dollars? 800 million dollars is financial loss caused by credit card fraud in Canada every year.

When a fraud occurs, information about the victim can be used against them. Passwords, personal identification numbers, and sometimes even the physical credit card can be stolen. However, the one thing that cannot be stolen is behavior.

A thief will often use the stolen credit card for their own purposes, and make distinct purchasing patterns. To demonstrate this, imagine that we have collected data on a credit card for the past few years. Suddenly, a new example, denoted with a red circle, appears.

We can tell that the new example is probably a fraud, based on the difference in behavior.

Gaussian Anomaly Detection

How do computers detect when a data point is different from the rest? The answer is anomaly detection. Falling in the category of semi-supervised learning, Gaussian anomaly detection finds the probability of a new (possibly fraudulent, or anomalous) data point distributed through Gaussian distribution, given a previous data set of non-fraudulent, or non-anomalous data.

The probability of a variable distributed through Gaussian distribution with mean μ and standard deviation 1. Source: Wikipedia

To perform Gaussian anomaly detection with a single variable:

  • Find the mean of all the data points. This is denoted with the Greek letter μ.
  • Calculate the variance of the data set, denoted with σ²:
Where m is the number of data points, and x is the data set for a single variable. The standard deviation, denoted with σ, is the square root of the variance.
  • To calculate the probability of a new variable distributed with Gaussian distribution, given the parameters μ and σ²:
Where the probability of the new variable is denoted P(x)
  • If P is less than some constant ε that we choose, we classify it as an anomaly.

This gives a Gaussian curve centered at μ and stretched by σ. We do not want to be limited to only one variable, so there are a few ways to get around this.

For an approach where we assume variables are independent to each other, to change P(x) to include more variables, fit μ and σ² for every variable, and take the product of the probabilities of each variable.

  • Compute μ and σ² with the formulas above for each individual variable.
  • To calculate the probability of a new variable, given a vector μ and a vector σ²:
Where μj and σ²j represent the mean and variance of the jth variable, and n is the dimension of a data point.
  • If P is less than some constant ε that we choose, we classify it as an anomaly.

When we fit μ and σ² to our previous example, we get a contour graph that looks like this:

Fitting a Gaussian distribution model through our data set based on μ and σ², most of the probability is heavily concentrated in an ellipse around our data set. The warmer colors in this example would represent a higher probability. The new example has a probability very close to zero, so we would classify it as an anomaly.

Multivariate Anomaly Detection

The previous approach where we multiplied the probabilities of each variable will not do will very well if the variables are correlated, since the previous approach could only stretch the probabilities in a horizontal or vertical direction, but not a diagonal one:

Although the data set seems to be going in a straight line, the ellipse cannot stretch in a diagonal direction with the previous method. The anomalous example is classified as non-anomalous.

Instead, we need to involve the covariance matrix, which as the name suggests, describes the covariance, or correlation, between variables. To perform multivariate Gaussian anomaly detection:

  • Compute μ for each variable.
  • Compute the covariance matrix, denoted by the Greek letter Σ.
Where x is an m*n matrix of our whole data set, m is the number of data points, and n is the number of dimensions for each data point, and μ is an n-dimensional vector
  • To calculate the probability of a new variable, given a vector μ and a matrix Σ:
Where |Σ| is the determinant of Σ, and Σ^-1 is the inverse of Σ. Note that the expression inside the exponent of e is a series of matrix multiplications of 1*n, n*n, and n*1 matrices, which ultimately end up as a single real number.
  • If P is less than some constant ε that we choose, we classify it as an anomaly.

It turns out that if we ran this exact setup on a data set with no correlations between variables, which makes Σ a diagonal matrix, we get the same result as taking the products of each of the individual variables. When we fit μ and Σ to our data set, we get this graph, where we correctly predict the probability of the new case being distributed through Gaussian distribution, and classify it accordingly:

“Anomalous!”

Anomaly Detection Applications and Usage

Anomaly detection can be applied in a variety of different situations, such as monitoring the health of a computer system, fault detection, and detecting ecosystem disturbances.

However, the question of when to use anomaly detection or logistic regression as a classifier still exists.

  • Anomaly detection, no matter Gaussian, or other types, are mainly used when there is a small number of anomalies, or sometimes none at all, in the data set, while there are a large number of normal examples.
  • Other training models, like logistic regression or neural networks, should be used when there is approximately the same number of anomalies and non-anomalies, or at least, if they are on the same scale.

Therefore, what makes anomaly detection such a valuable tool is that the user does not need a lot of data about anomalous examples. In many of the applications, there are types of anomalies that we have never seen, and have no way of detecting with other learning models.

Summary

  • Anomaly detection finds peculiar data points that do not match the rest.
  • Univariate Gaussian anomaly detection uses the mean and variance of a data set to calculate the probability of a new data point, normally distributed.
  • Multivariate Gaussian anomaly detection uses the covariance matrix to calculate the probability of a new data point, considering correlations between variables.
  • Anomaly detection should be used for previously unseen types of data.

--

--