Anomaly Detection — Part 1

Raviteja Arava
Analytics Vidhya
Published in
5 min readJun 26, 2020

Can Mean or Median be an Anomaly?

An anomaly is a data point that differs significantly from other observations. However, it might not be extreme value. For example, take data set {25, 8, 9, 18, 11, 12, 50, 80, 83, 85, 83, 84, 102} of one-dimensional value. There are thirteen numbers in the data set. The mean of this data set is 50. The standard deviation of this data is 36.76729.

If we sort the given data set, we get {8, 9, 11, 12, 18, 25, 50, 80, 83, 83, 84, 85, 102}. As the data set has thirteen numbers, the seventh number is median. So, the median is 50. On further computation, we find that 10th number is Q3 = 83 and 4th number is Q1 = 12. Interquartile Range = Q3 — Q1 = 83–12 = 71

Anomaly Based on Mean and Standard Deviation:

We can compute outlier using mean and standard deviation. Below, we give anomalies based on mean and standard deviation:

(1) Weak Anomaly: Any value outside (Mean — Standard Deviation) and (Mean + Standard Deviation).

Lower Bound = 50–36.76729 = 13.23271

Upper Bound = 50 + 36.76729 = 86.76729

From this, we find that values 8, 9, 11 and 12 are smaller than lower bound and 102 is bigger than upper bound. So, values 8, 9, 11, 12 and 102 are weak anomalies based on mean and standard deviation.

(2) Medium Anomaly: Any value outside (Mean — 2*Standard Deviation) and (Mean + 2*Standard Deviation).

Let us compute lower and upper bound as per this method. We find

Lower Bound = 50–2*36.76729 = -23.53458

Upper Bound = 50 + 2*36.76729 = 123.53458

As no value is below lower bound and there is also no value above upper bound. There is no anomaly based on medium anomaly.

(3) Strong Anomaly: Any value outside (Mean — 3*Standard Deviation) and (Mean + 3*Standard Deviation).

As there is no medium anomaly, it is not possible to have any strong anomaly. However, let us still compute the lower and upper bound.

Lower Bound = 50–3*36.76729 = -60.30187

Upper Bound = 50 + 3*36.76729 = 160.30187

As there is no value outside the lower and upper bound, there is no strong anomaly based on this method.

Anomaly Based on Interquartile Range:

Below, we give anomalies based on interquartile range:

(1) Weak Anomaly: Any value outside (Q1–0.75*Interquartile Range) and (Q3 + 0.75*Interquartile Range).

Lower Bound = Q1–0.75*Interquartile Range = 12–0.75*71 = -41.25

Upper Bound = Q3 + 0.75*Interquartile Range = 83 + 0.75*71 = 136.25

As per this method, there is no value outside the lower and upper bound, so there is no weak anomaly.

(2) Medium Anomaly: Any value outside (Q1 — Interquartile Range) and (Q3 + Interquartile Range).

Lower Bound = Q1 — Interquartile Range = 12–71 = -59

Upper Bound = Q3 + Interquartile Range = 83 + 71 = 154

As there is no value outside the lower and upper bound, so there is no medium anomaly.

(3) Strong Anomaly: Any value outside (Q1–1.5*Interquartile Range) and (Q3 + 1.5*Interquartile Range).

Lower Bound = Q1–1.5*Interquartile Range = 12–1.5*71 = -94.5

Upper Bound = Q3 + 1.5*Interquartile Range = 83 + 1.5*71 = 189.5

As there is no value outside the lower and upper bound, so there is no strong anomaly as per interquartile range.

If we carefully study the above methods, we find that value should be far from mean and median to be anomaly. Thus, it is highly unlikely that mean or median can be anomaly.

I have used a software “Discover” to compute anomaly for the given data. Please note that “Discover” can be used for millions of rows and hundreds of columns. It can compute anomaly from millions of rows and hundreds of columns in few minutes. So, using it for 13 rows and one column was over-killing it. However, I wanted to find outlier of this data set, so I have used it just out of curiosity. Please note that I have used “Discover” on unsorted data. Below, I give screenshot of the result:

Discover’s result for Anomaly detection of the data set mentioned above

From the above, we find that value 50 is anomaly with the score of 72.42% and 102 is anomaly with the score of 50.98%. So, if there is any anomaly it must be 50, though it is likely to be weak anomaly as score is 72%. Please note that 50 is the mean and median of the given data. Now, statisticians probably will not agree with the result but let us analyse and do further research.

Let us create the graph of sorted values:

Data points of the mentioned data set plotted as graph

If we see the above graph, we have divided the values into four cluster. The first cluster is containing the values 8,9,11,12,18 and 25. These are close to each other. The value 50 (mean and median) is in cluster 2. It is isolated and far from any other cluster. So, if there is any anomaly, it must be this only. The values 80,83,83,84 and 85 are close to each other and are in the third cluster. The last value of 102 is in fourth cluster. This is also isolated but not far from 85. Thus, the result given by “Discover” are highly accurate and the score denotes its outlier score in percent.

The nearest value to 50 is 25, the difference is 25. The next nearest value to 50 is 80, the difference is 30. So, we find that both the nearest values are far from 50. However, there is no other value that is thus far from nearest values. For example, the nearest value to 102 is 85, and the difference is 17. The next nearest value is 84 and the difference is 18.

Thus, mean and median can be outliers though these are likely to be weak anomalies and only in exceptional circumstances.

Managing Inventory Efficiently using Anomaly:

Can we use Anomalies to detect Excess Inventory and Short Inventory (to be ordered urgently to avoid stock-out)? I have received data of sales and net inventory with lead time containing around 100,000 items from a company. The company was facing the challenge of maintaining optimized inventory. Some inventory items were at higher level than required and others were not having sufficient stock resulting in both loss of revenue and customers. As I cannot share the real data, I have changed the Inventory ID and taken 10% of random items.

Further details about this will be provided in the next part.

--

--

Raviteja Arava
Analytics Vidhya

Machine Learning & Strategy Consultant| Indian Institute of Technology Madras | Indian Institute of Management Bangalore