Anomaly Detection for Beginners
Anomalies and outliers and how to find them.
When I started to write this article the first thing I did was to look for formal definitions of anomaly and outlier. Turns out that there isn’t a consensus on the matter. Every field would have (annoying) a slightly different opinion:
- Statisticians, for the most part, will use the two terms interchangeably. [ref]
- Climatologists would say that a (temperature) anomaly is the difference between value and the mean [ref]
- In manufacturing, anomalies are defects (I have not seen anyone use the term outlier)
- In banking and insurance anomalies are synonymous with fraud (but in some papers, they would use the term outlier interchangeably)
- And don’t ask physicists on their opinion, unless you understand quantum field theory …
At the risk of annoying some people, I’d like to draw a distinct line between anomaly and outlier in the context of Machine Learning. They are not original definitions, merely an attempt to translate the opinion of much more intelligent and experienced people, to simpler terms.
Outlier
An outlier is an improbable data point
Imagine that you have a shoe store and would like to know how many shoes of each size you need to order. You call your friendly neighborhood data scientists. He pulls some data and sends you the following:
That looks exactly like you expected — a nice normal distribution; sizes 40
and 41
being the most common. But wait, you look at the table to the right and see something really odd — size 50
. You have been selling shoes for years, and have never heard of such monstrosity. You call back your data scientist and ask for your money back, because the data is incorrect. The friendly data scientists then explains to you that:
“1 in 100,000 people has a size
50
shoes.That is very rare, but still possible.
That is an outlier!”
Anomaly
An anomaly is something created by a different process entirely. There is a different underlying function or a distribution.
Back in the shoe store, we now want to run a campaign where we give you a free pair of jeans with every shoe purchase. We want to make sure that we order the correct jean sizes and tie those to shoe sizes (please just roll with it …).
So who you gonna call? That’s right, back we go to that overworked and slightly underpaid data scientist. He now sends us something a bit more complicated:
This is a scatter plot that shows how shoe size and height are correlated. The line in middle is a linear model that shows a very high correlation between the two variables.
However, your keen eye has spotted something wrong with this picture. Let me use the power of colour and show you what I am talking about:
There are two orange dots on the chart that seem a bit out of place. I have pasted them below for closer investigation.
- Both shoe sizes
35
and47
are within our expected range. - Heights of
158
and183
are also quite normal
It is almost impossible for a human to be
183
cm tall and have size35
shoes. Or be158
cm with size47
shoes (unless he is called Sideshow Bob). That is an anomaly.
Lets look a bit closer at the people that wear size 35
:
As you can see a height of 183
is more than 10cm
than the largest value in this group (shoe size 35
). That data point is not realistically part of the population.
Recap
When looking at shoes size or height individually (univariate), we only see outliers — values that are rare but still probable. Analyzing the dataset as a whole (multivariate) we can see combination of values that are anomalies.
Anomaly detection
Now that we (hopefully) understand how this works, let’s take it up a notch and go looking for anomalies in a dataset with 3 variables.
I will kick in an extra variable called weight
(to our previous dataset) and boot out the shoe store allegory.
These kind of plots are very useful to get a first glimpse into how the different variables interact with each other.:
- All three variables are normally distributed
shoe_size
andheight
are positively correlatedweight
andheight
are also positively correlatedshoe_size
andweight
are not correlated
We can try and find the anomalies like we did in the previous section, but that will take a while. There we had one scatter plot that we could visually inspect, now we have three. A more systematic approach is required if you don’t want to spend the rest of the day staring at dots and histograms.
PCA
Enter Principal Component Analysis. I am not going to go into details about it, but you can follow the link if you want to know more or need a refresher.
PCA uses fancy algebra to take any number of variables (dimensions) and reduce them to a more manageable number without losing much of the information stored in the data.
In our example we take the 3 variables and reduce them 2 and we get to keep 99% of the explanatory power. Needless to say this is much more impressive (and useful) when you have even more variables, but let’s keep it simple.
One of the upsides of PCA is that you can more easily visualize your data. Now we can much more easily see those outlier/anomalies:
But before you get your crayons out and start colouring in those dots let me show you another trick from the data science hat. We are really deep into the rabbits hole now, please stay with me, we are almost in Wonderland.
DBSCAN
DBSCAN is an unsupervised clustering algorithm (you can get a nice intro here). We can feed it the PCA data and some basic rules and have it figure out which is the main cluster. It does that by figuring out how close each point is to its neighbors. If it’s surrounded by lots of them — it’s part of a cluster. If a data point is too far — it’s an anomaly.
DBSCAN().labels_
returns an array where each cluster is given a number, starting from 0
and anything that doesn’t fit into a cluster is -1
. Since our dataset is very uniform, we only have one cluster.
In the image below you can see the PCA plot from before, with the anomalies coloured in salmon.
The hardest thing in this kind of analysis is always finding a good boundary. And that is more of an art than a science despite the plethora of statistical and ML methods swimming out there.
For the sake of argument let’s agree that our salmon dots are quite the catch and move to the next step.
PC1 and PC2 are not very useful numbers for us mortals, so we need to go back to the original data. Below we add the anomalies (DBSCAN clsuter) to the pairplot from before:
With this visualization it is much easier to explain why certain points are anomalies.
What is even more interesting is that we have found anomalies that we would have missed if we used the method from the first part of the article. Look at the anomaly data points in the bottom left scatter plot (weight~height
). We could have probably found the ones that are visually outside the main “cluster” (distribution), but we would have missed the ones that are “inside” (with a higher frequence/probability).
And for the viewing pleasure of the spreadsheet nerds, you can find a random sample of the anomalies and their values below:
Let’s analyze some of them:
76283
(2nd row) — height and shoe size seem acceptable, but the weight is too low100024
(last row) — height and weight are fine, but the shoe size seems proposterous100001
(somewhere in the middle) — that observation is the definition of anomaly.184
cm tall at64
kg and size35
feet? Sounds like a Slenderman to me.
Final words
- The dataset I used was hand crafted by me for the purposes of this article, so don’t go around using those graphs to draw conclusions about the real world.
- Real datasets are much more complicated and messy. Having a good grasp of the process or business logic behind them will go a long way in the quest of anomaly detection.
- The notebook(s) for this article can be found here
Hope you enjoyed reading this as much as me writing it. Questions and comments are as always, highly encouraged.