Handbook of Anomaly Detection: with Python Outlier Detection — (2) HBOS

Chris Kuo/Dr. Dataman
Dataman in AI
Published in
11 min readOct 20, 2021

--

Consider multi-dimensional data like a data frame in an Excel Spreadsheet. The columns are the dimensions or variables, and the rows are the observations. An observation had multiple values. The count statistic of a variable is called the histogram. If there are N variables, there will be N histograms. If a value of observation falls in the tail of a histogram, the value is an outlier. If many values of observation are outliers, the observation is very likely to be an outlier.

The columns are also called variables. With all the observations, we can derive the count statistic, called a histogram, for each variable. If a value of observation falls in the tail of a histogram, the value is an outlier. It is often the case that some values of observation are outliers in terms of the corresponding variables, but some values are normal. If many values of observation are outliers, the observation is very likely to be an outlier.

With this intuition, the histogram of a variable can be used to define the univariate outlier score for a variable. An observation shall have N univariate outlier scores. The technique assumes independence between variables to derive histograms and the univariate outlier scores. The N univariate outlier scores of an observation can be summed up to become the Histogram-based Outlier…

--

--