Anomaly Detection for Beginners

Ivan N.
Datasparq Technology
8 min readApr 14, 2021

Anomalies and outliers and how to find them.

When I started to write this article the first thing I did was to look for formal definitions of anomaly and outlier. Turns out that there isn’t a consensus on the matter. Every field would have (annoying) a slightly different opinion:

  • Statisticians, for the most part, will use the two terms interchangeably. [ref]
  • Climatologists would say that a (temperature) anomaly is the difference between value and the mean [ref]
  • In manufacturing, anomalies are defects (I have not seen anyone use the term outlier)
  • In banking and insurance anomalies are synonymous with fraud (but in some papers, they would use the term outlier interchangeably)
  • And don’t ask physicists on their opinion, unless you understand quantum field theory
Manufacturing defects (anomalies). Source:Kamoona et al.

At the risk of annoying some people, I’d like to draw a distinct line between anomaly and outlier in the context of Machine Learning. They are not original definitions, merely an attempt to translate the opinion of much more intelligent and experienced people, to simpler terms.

Outlier

An outlier is an improbable data point

Imagine that you have a shoe store and would like to know how many shoes of each size you need to order. You call your friendly neighborhood data scientists. He pulls some data and sends you the following:

Syntetic data (does not represent reality)

That looks exactly like you expected — a nice normal distribution; sizes 40and 41 being the most common. But wait, you look at the table to the right and see something really odd — size 50. You have been selling shoes for years, and have never heard of such monstrosity. You call back your data scientist and ask for your money back, because the data is incorrect. The friendly data scientists then explains to you that:

“1 in 100,000 people has a size 50 shoes.That is very rare, but still possible.
That is an outlier!”

A 6ft 5in bride-to-be with size 15 feet is unable to find wedding shoes big enough. Source: The Sun

Anomaly

An anomaly is something created by a different process entirely. There is a different underlying function or a distribution.

Back in the shoe store, we now want to run a campaign where we give you a free pair of jeans with every shoe purchase. We want to make sure that we order the correct jean sizes and tie those to shoe sizes (please just roll with it …).

Source

So who you gonna call? That’s right, back we go to that overworked and slightly underpaid data scientist. He now sends us something a bit more complicated:

Linear regression of shoe size and height

This is a scatter plot that shows how shoe size and height are correlated. The line in middle is a linear model that shows a very high correlation between the two variables.

However, your keen eye has spotted something wrong with this picture. Let me use the power of colour and show you what I am talking about:

There are two orange dots on the chart that seem a bit out of place. I have pasted them below for closer investigation.

  • Both shoe sizes 35 and 47 are within our expected range.
  • Heights of 158 and 183 are also quite normal

It is almost impossible for a human to be 183 cm tall and have size 35shoes. Or be158 cm with size 47 shoes (unless he is called Sideshow Bob). That is an anomaly.

Everyone knows that Sideshow Bob mas massive feet. Source

Lets look a bit closer at the people that wear size 35 :

As you can see a height of 183 is more than 10cm than the largest value in this group (shoe size 35). That data point is not realistically part of the population.

Recap

When looking at shoes size or height individually (univariate), we only see outliers — values that are rare but still probable. Analyzing the dataset as a whole (multivariate) we can see combination of values that are anomalies.

Anomaly detection

No, not that kind of anomaly. Source

Now that we (hopefully) understand how this works, let’s take it up a notch and go looking for anomalies in a dataset with 3 variables.
I will kick in an extra variable called weight (to our previous dataset) and boot out the shoe store allegory.

Pairplot that shows relationship between all the variables

These kind of plots are very useful to get a first glimpse into how the different variables interact with each other.:

  • All three variables are normally distributed
  • shoe_size and height are positively correlated
  • weight and height are also positively correlated
  • shoe_size and weight are not correlated

We can try and find the anomalies like we did in the previous section, but that will take a while. There we had one scatter plot that we could visually inspect, now we have three. A more systematic approach is required if you don’t want to spend the rest of the day staring at dots and histograms.

PCA

Enter Principal Component Analysis. I am not going to go into details about it, but you can follow the link if you want to know more or need a refresher.

PCA uses fancy algebra to take any number of variables (dimensions) and reduce them to a more manageable number without losing much of the information stored in the data.

In our example we take the 3 variables and reduce them 2 and we get to keep 99% of the explanatory power. Needless to say this is much more impressive (and useful) when you have even more variables, but let’s keep it simple.

One of the upsides of PCA is that you can more easily visualize your data. Now we can much more easily see those outlier/anomalies:

But before you get your crayons out and start colouring in those dots let me show you another trick from the data science hat. We are really deep into the rabbits hole now, please stay with me, we are almost in Wonderland.

Source

DBSCAN

No, not that kind of scan. Source

DBSCAN is an unsupervised clustering algorithm (you can get a nice intro here). We can feed it the PCA data and some basic rules and have it figure out which is the main cluster. It does that by figuring out how close each point is to its neighbors. If it’s surrounded by lots of them — it’s part of a cluster. If a data point is too far — it’s an anomaly.

DBSCAN().labels_ returns an array where each cluster is given a number, starting from 0 and anything that doesn’t fit into a cluster is -1 . Since our dataset is very uniform, we only have one cluster.

In the image below you can see the PCA plot from before, with the anomalies coloured in salmon.

The hardest thing in this kind of analysis is always finding a good boundary. And that is more of an art than a science despite the plethora of statistical and ML methods swimming out there.

For the sake of argument let’s agree that our salmon dots are quite the catch and move to the next step.

Photo by NOAA on Unsplash

PC1 and PC2 are not very useful numbers for us mortals, so we need to go back to the original data. Below we add the anomalies (DBSCAN clsuter) to the pairplot from before:

With this visualization it is much easier to explain why certain points are anomalies.

What is even more interesting is that we have found anomalies that we would have missed if we used the method from the first part of the article. Look at the anomaly data points in the bottom left scatter plot (weight~height). We could have probably found the ones that are visually outside the main “cluster” (distribution), but we would have missed the ones that are “inside” (with a higher frequence/probability).

Source

And for the viewing pleasure of the spreadsheet nerds, you can find a random sample of the anomalies and their values below:

sample of 20 anomalies (out of 39)

Let’s analyze some of them:

  • 76283 (2nd row) — height and shoe size seem acceptable, but the weight is too low
  • 100024 (last row) — height and weight are fine, but the shoe size seems proposterous
  • 100001 (somewhere in the middle) — that observation is the definition of anomaly. 184cm tall at 64 kg and size 35 feet? Sounds like a Slenderman to me.
Slenderman. Source

Final words

  • The dataset I used was hand crafted by me for the purposes of this article, so don’t go around using those graphs to draw conclusions about the real world.
  • Real datasets are much more complicated and messy. Having a good grasp of the process or business logic behind them will go a long way in the quest of anomaly detection.
  • The notebook(s) for this article can be found here

Hope you enjoyed reading this as much as me writing it. Questions and comments are as always, highly encouraged.

--

--

Ivan N.
Datasparq Technology

When the machines take over, I will be on the winning side 🤖