Finding Unusual Events

Anomaly detection algorithms look at an unlabeled dataset of normal events and thereby learns to detect or to raise a red flag for if there is an unusual or an anomalous event.

Akshita Guru
Operations Research Bit
8 min readMay 2, 2024

--

Welcome back! We will now talk about an additional algorithm called ANOMALY DETECTION.

The most common way to carry out anomaly detection is through a technique called density estimation.

And what that means is, when you’re given your training sets of these m examples, the first thing you do is build a model for the probability of x. In other words, the learning algorithm will try to figure out what are the values of the features x1 and x2 that have high probability and what are the values that are less likely or have a lower chance or lower probability of being seen in the data set.

In this example that we have here, I think it is quite likely to see examples in that little ellipse in the middle, so that region in the middle would have high probability maybe things in this ellipse have a little bit lower probability. Things in this ellipse of this oval have even lower probability and things outside have even lower probability.

What you will do is to compute the probability of Xtest. If it is small or more precisely, if it is less than some small number then it is call epsilon, this is a greek alphabet epsilon. So what you should think of as a small number, which means that p of x is very small or in other words, the specific value of x that you saw for a certain user was very unlikely, relative to other usage that you have seen. But the p of Xtest is less than some small threshold or some small number epsilon,we will raise a flag to say that this could be an anomaly. Whereas in contrast, if p of Xtest is not less than epsilon, if p of Xtest is greater than equal to epsilon, then we will say that it looks okay, this doesn’t look like an anomaly.

Anomaly detection is used today in many applications.

It is frequently used in fraud detection where for example if you are running a website with many different features. If you compute xi to be the features of user i’s activities. And examples of features might include, how often does this user login and how many web pages do they visit? How many transactions are they making or how many posts on the discussion forum are they making to what is their typing speed? How many characters per second do they seem able to type. With data like this you can then model p of x from data to model what is the typical behavior of a given user.

In the common workflow of fraud detection, you wouldn’t automatically turn off an account just because it seemed anomalous. But instead you may ask the security team to take a closer look or put in some additional security checks such as ask the user to verify their identity with a cell phone number or ask them to pass a capture to prove that they’re human and so on.

But algorithms like this are routinely used today to try to find unusual or maybe slightly suspicious activity. So you can more carefully screen those accounts to make sure there isn’t something fraudulent.

And this type of fraud detection is used both to find fake accounts and this type of algorithm is also used frequently to try to identify financial fraud such as if there’s a very unusual pattern of purchases. Then that may be something well worth a security team taking a more careful look at.

Anomaly detection is also frequently used in manufacturing. Anything from an airplane engine to a printed circuit board to a smartphone to a motor, to many, many things to see if you’ve just manufactured the unit that somehow behaves strangely. Because that may indicate that there’s something wrong with your airplane engine or printed circuit boards or what have you that might cause you to want to take a more careful look before you ship that object to the customer. Is also used to monitor computers in clusters and in data centers where if X I are the features of a certain machine I such as if the features captured the memory users, the number of disk accesses per second. CPU load features can also be ratios, such as the ratio of CPU load to network traffic. Then if ever a specific computer behaves very differently than other computers, it might be worth taking a look at that computer to see if something is wrong with it. Such as if it has had a hard disk failure or network card failure or something’s wrong with it or if maybe it has been hacked into.

Anomaly detection is one of those algorithms that is very widely used even though you don’t seem to hear people talk about it that much.

In order to apply anomaly detection, we’re going to need to use the Gaussian distribution, which is also called the normal distribution.

But let’s take a look at what is the Gaussian or the normal distribution.

Say x is a number, and if x is a random number, sometimes called the random variable, x can take on random values. If the probability of x is given by a Gaussian or normal distribution with mean parameter Mu, and with variance Sigma squared. What that means is that the probability of x looks like a curve that goes like this. The center or the middle of the curve is given by the mean Mu, and the standard deviation or the width of this curve is given by that variance parameter Sigma. Technically, Sigma is called the standard deviation and the square of Sigma or Sigma squared is called the variance of the distribution. This curve here shows what is p of x or the probability of x.

If you’re wondering what does p of x really means?

Here’s one way to interpret it. It means that if you were to get, say, 100 numbers drawn from this probability distribution, and you were to plot a histogram of these 100 numbers drawn from this distribution, you might get a histogram that looks like this. It looks vaguely bell-shaped. What this curve on the left indicates is not if you have just 100 examples or 1,000 or a million or a billion. But if you had a practically infinite number of examples, and you were to draw a histogram of this practically infinite number of examples with a very fine histogram bin.

The formula for p of x is given by this expression; p of x equals 1 over square root 2 Pi. Pi here is that 3.14159 or it’s about 22 over 7. Ratio of a circle’s diameter circumference times Sigma times e to the negative x minus Mu, the mean parameter squared divided by 2 Sigma squared. For any given value of Mu and Sigma, if you were to plot this function as a function of x, you get this type of bell-shaped curve that is centered at Mu, and with the width of this bell-shaped curve being determined by the parameter Sigma.

Now let’s look at a few examples of how changing Mu and Sigma will affect the Gaussian distribution. First, let me set Mu equals 0 and Sigma equals 1. Here’s my plot of the Gaussian distribution with mean 0, Mu equals 0, and standard deviation Sigma equals 1. You notice that this distribution is centered at zero and that is the standard deviation Sigma is equal to 1. Now, let’s reduce the standard deviation Sigma to 0.5. If you plot the Gaussian distribution with Mu equals 0 and Sigma equals 0.5, it now it looks like this. Notice that it’s still centered at zero because Mu is zero. But it’s become a much thinner curve because Sigma is now 0.5. You might recall that Sigma is the standard deviation is 0.5, whereas Sigma squared is also called the variance. That’s equal to 0.5 squared or 0.25. You may have heard that probabilities always have to sum up to one, so that’s why the area under the curve is always equal to one, which is why when the Gaussian distribution becomes skinnier, it has to become taller as well. Let’s look at another value of Mu and Sigma. Now, I’m going to increase Sigma to 2, so the standard deviation is 2 and the variance is 4. This now creates a much wider distribution because Sigma here is now much larger, and because it’s now a wider distribution is become shorter as well because the area under the curve is still equals 1. Finally, let’s try changing the mean parameter Mu, and I’ll leave Sigma equals 0.5. In this case, the center of the distribution Mu moves over here to the right. But the width of the distribution is the same as the one on top because the standard deviation is 0.5 in both of these cases on the right. This is how different choices of Mu and Sigma affect the Gaussian distribution.

When you’re applying this to anomaly detection,here’s what you have to do.

You are given a dataset of m examples, and here x is just a number. Here, are plots of the training sets with 11 examples. What we have to do is try to estimate what a good choice is for the mean parameter Mu, as well as for the variance parameter Sigma squared. Given a dataset like this, it would seem that a Gaussian distribution maybe looking like that with a center here and a standard deviation like that. This might be a pretty good fit to the data. The way you would compute Mu and Sigma squared mathematically is our estimate for Mu will be just the average of all the training examples. It’s 1 over m times sum from i equals 1 through m of the values of your training examples. The value we will use to estimate Sigma squared will be the average of the squared difference between two examples, and that Mu that you just estimated here on the left.It turns out that if you implement these two formulas in code with this value for Mu and this value for Sigma squared, then you pretty much get the Gaussian distribution.

References

[1] Clustering is one of the unsupervised learning techniques that we have already covered; you can read about it here.

[2] I hope you found this summary of anomaly detection to be interesting. You can connect me on the following: Linkedin | GitHub | Medium | email : akshitaguru16@gmail.com

--

--