Machine Learning for Anomaly Detection- The Mathematics Behind It!

Published in

SRM MIC

6 min readSep 25, 2020

Machine Learning has got its application spread across a variety of domains. Today I’ll write about one such real life application of Machine Learning which is extensively used to detect a defective item from a mixture of both defective and non-defective items.

Before we jump into the Algorithm, let’s have some touch up on basic Statistics.

Gaussian Distribution :-

A Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. It is a bell-shaped curve that signifies the probability distribution of a variable say ‘X’ parameterized by the distribution’s mean and variance.

The probability density of the distribution is found out from the following formula:-

The value for f(x) is found out by computing the values of mean and variance for a particular value of x and substituting them in the above equation. The graph somewhat looks like the following image after plotting x in the x-axis and f(x) in the y-axis.

The x value corresponding to the highest point on the graph is the mean of the distribution and greater the value of sigma or standard deviation which is the square root of the variance, more spread out the curve is. Here, in this case it is a Standard Normal distribution with mean= 0 and standard deviation= 1.

But why do we need it?

We’ll come to the answer shortly but before that let’s look at how our training set with m examples and n features is represented:-

where x is an n-dimensional vector.

Now, we’ll plot examples under the features say x(1) and x(2) in a graphical format where the points mostly correspond to non-anomalous examples (Let us suppose the graph somewhat looks like this):-

Image Courtesy :- Coursera(Machine Learning by Andew NG , Lecture :- 15)

Here we see that most of the points are concentrated around the center, i.e. the density is the highest around the center and it keeps on decreasing as we move away. We can guess from this observation that the probability of finding a non-anomalous example in this case increases as we approach the center or the probability of finding a non-anomalous example increases as the density of the other non-anomalous examples under features x(1) and x(2) increases. So when we are given two more points which lie somewhere like this in the following graph-

We can guess that one of them lies in the higher density region and thus indicates a non-anomalous example whereas the other one which is placed far away indicates an anomalous one.

Now let’s find the answer to our above question. Gaussian Distribution is a very important and fundamental part of statistics which is used in a lot of mathematical problems. One such use of it is in this Anomaly Detection problem. We assume that all features follow Gaussian Distribution(for an ideal case). So, we plot the probability distribution for the density estimation of a particular feature containing m examples(considering their mean and variance). From there, we deduce that the points that do not lie on the graph are the anomalous cases.

The red cross marks are our anomalous cases and the graph is our Gaussian Distribution of a particular feature containing m examples. As mentioned above, we calculate the value of f(x) for a particular value of x from the following formula

where μ and σ squared are mean and variance respectively.

Choosing what features to use

As previously discussed, we assume that all features containing a particular number of examples follow Gaussian Distribution. But this is applicable for only ideal cases. In real world datasets, there are a large number of features to be chosen from. All these features might not follow this distribution. So, it is very important to choose only those features that do follow a Gaussian Distribution. We can either choose them directly after plotting their graphs or we can transform the Non-Gaussian features and the examples under it in such a way that the transformed feature with its examples follow our distribution.

For example, if a particular feature x(j) does not give us a Gaussian Distribution, we can transform it by either squaring it(x(j)²) or by taking its logarithmic value(log(x(j)) to get a graph that is Gaussian.

Now let’s merge things up and come up with the algorithm!

We have our training set with n features containing mostly non-anomalous examples. We want to find the probability distribution of each feature and merge them up to get a particular function that best fits our training set.

And so for this, we first find the parameters μ and σ for each and every feature in our training set. We calculate the mean(μ) and variance(σ²) from the following formulas and fit the parameters to our model:-

Here “i” represents the number of examples and “j” represents the number of features. Then we calculate f(x) or probability distribution for every feature. Since the features are not related to each other, they are considered as independent observations. Hence, we multiply them together to get the value of p(x) which is also called the “Likelihood Function”, which will in turn be used to determine our result. That is:-

Now, given a new example x, we calculate this function for our new example(p(x)) from the following formula:-

From this formula the probability of this new example being an anomaly is calculated w.r.t each and every feature taken independently. They are then multiplied together to find the likelihood function for that example or in another way to find out how likely is that example to become an anomaly.

Now that we’ve found out p(x) for this example we check if its smaller than the value of another variable ε. The value of ε is equal to a very small number which is determined from the results of p(x) of our training set. This ε value then acts as the threshold between anomalous and non-anomalous cases. We flag it as an anomaly if it is lesser than epsilon. That is:-

From this condition we determine our target value or y where:-

Conclusion

Anomaly Detectors are used worldwide in various industries for a multitude of purposes. They are a key part of building robust distributed software. I hope this article gives a little insight as to how they really work!