Anomaly Detection Algorithm

Given this training set, what we would like to do is to carry out density estimation and all that means is, we will build a model or estimate the probability for p(x).

Akshita Guru
Operations Research Bit
7 min readMay 24, 2024

--

Welcome back! Now that you’ve seen how the Gaussian or the normal distribution works for a single number, we’re ready to build our anomaly detection algorithm. Let’s dive in.

What’s the probability of any given feature vector? And our model for p(x) is going to be as follows, x is a feature vector with values x1, x2 and so on down to xn. And I’m going to model p(x) as the probability of x1, times the probability of x2, times the probability of x3 times the probability of xn, for the n th features in the feature vectors.

Now, to fill in this equation a little bit more, we are saying that the probability of all the features of this vector features x, is the product of p(x) 1 and p(x2) and so on up through p(xn). And in order to model the probability of x1, say the heat feature in this example we’re going to have two parameters, mu 1 and sigma 1 or sigma squared is 1.

And what that means is we’re going to estimate, the mean of the feature x1 and also the variance of feature x1 and that will be new 1 and sigma 1. To model p(x2) x2 is a totally different feature measuring the vibrations of the airplane engine.

We’re going to have two different parameters, which I’m going to write as mu 2, sigma 2 squared. And it turns out this will correspond to the mean or the average of the vibration feature and the variance of the vibration feature and so on. If you have additional features mu 3 sigma 3 squared up through mu n and sigma n squared.

  • Suppose for an aircraft engine there’s a 1/10 chance that it is really hot, unusually hot and maybe there is a 1 in 20 chance that it vibrates really hard. Then, what is the chance that it runs really hot and vibrates really hard. We’re saying that the chance of that is 1/10 times 1/20 which is 1/200.
  • So it’s really unlikely to get an engine that both run really hot and vibrates really hard. It’s the product of these two probabilities A somewhat more compact way to write this equation up here, is to say that this is equal to, the product from j =1 through n of p(xj).

So let’s put it all together to see how you can build an anamoly detection system.

  • The first step is to choose features xi that you think might be indicative of anomalous examples. Having come up with the features you want to use, you would then fit the parameters mu 1 through mu n and sigma square 1 through sigma squared n, for the n features in your data set. As you might guess, the parameter mu j will be just the average of xj of the feature j of all the examples in your training set. And sigma square j will be the average of the square difference between the feature and the value mu j, that you just computed. And by the way, if you have a vectorized implementation, you can also compute mu as the average of the training examples as follows, we’re here, x and mu are both vectors. And so this would be the vectorized way of computing mu 1 through mu and all at the same time. And by estimating these parameters on your unlabeled training set, you’ve now computed all the parameters of your model.
  • Finally, when you are given a new example, x test or I’m just going to write a new example as x here, what you would do is compute p(x) and see if it’s large or small. So p(x) as you saw on the last slide is the product from j = 1 through n of the probability of the individual features. So p(x) j with parameters mu j and single square j. And if you substitute in, the formula for this probability, you end up with this expression 1 over root 2 pi sigma j of e to this expression over here. And so xj are the features, this is a j feature of your new example, mu j and sigma j are numbers or parameters you have computed in the previous step.And if you compute out this formula, you get some number for p(x).

1 intuition behind what this algorithm is doing is that it will tend to flag an example as anomalous if 1 or more of the features are either very large or very small relative to what it has seen in the training set. So for each of the features x j, you’re fitting a Gaussian distribution like this. And so if even 1 of the features of the new example was way out here, say, then P f xJ would be very small. And if just 1 of the terms in this product is very small, then this overall product, when you multiply together will tend to be very small and does p(x) will be small. And what anomaly detection is doing in this algorithm is a systematic way of quantifying whether or not this new example x has any features that are unusually large or unusually small.

  • Now, let’s take a look at what all this actually means on 1 example, Here’s the data set with features x 1 and x 2. And you notice that the features x 1 take on a much larger range of values than the features x 2. If you were to compute the mean of the Features x 1, you end up with five, which is why you want is equal to 1. And it turns out that for this data said, if you compute sigma 1, it will be equal to about 2. And if you were to compute mu to the average of the features on next to the average is three and similarly is variance or standard deviation is much smaller, which is why Sigma 2 is equal to 1.
  • So that corresponds to this Gaussian distribution for x 1 and this Gaussian distribution for x 2. If you were to actually multiply p(x) 1 and p(x) 2, then you end up with this three D surface plot for p(x) where any point, the height of this is the product of p(x) 1 times p(x) 2. For the corresponding values of x 1 and x 2. And this signifies that values where p(x) is higher or more likely. So, values near the middle kind of here are more likely. Whereas values far out here, values out here are much less likely. I have much lower chance.
  • Now, let me pick two test examples, the first 1 here, I’m going to write this x test 1 and the second 1 down here as x test 2. And let’s see which of these 2 examples the algorithm will flag as anomalous. I’m going to pick The Parameter ε to be equal to 0.02. And if you were to compute p(x) test 1, it turns out to be about 0.4 and this is much bigger than epsilon. And so the album will say this looks okay, doesn’t look like an anomaly. Whereas in contrast, if you were to compute p(x) for this point down here corresponding to x 1 equals about eight and x 2 equals about 0.5 Kind of down here, then p(x) test 2 is 0.0021. So this is much smaller than epsilon. And so the album will flag this as a likely anomaly. So, pretty much as you might hope it decides that x test 1 looks pretty normal. Whereas excess to which is much further away than anything you see in the training set looks like it could be an anomaly.

So you’ve seen the process of how to build an anomaly detection system.

References

[1] I hope you found this summary of anomaly detection algorithm to be interesting. You can connect me on the following: Linkedin | GitHub | Medium | email : akshitaguru16@gmail.com

--

--