On detecting outliers with Alibi Detect

Sabrine Hamroun, PhD
fifty-five | Data Science
9 min readSep 10, 2021

Outliers are data points which are abnormally distant from the rest of the observations in a dataset. They are mainly due to data errors (measurement or experimental errors, data collection or processing errors, etc) or naturally very singular and different behavior from the normal one (e.g. very few people aged more than 100 years). Keeping them in the dataset can significantly distort your statistical analysis and modeling conclusions: they can change the mean and standard deviation values, to name a few.

Thus, it is very important to accurately detect and handle them, either by eliminating outlier input or cutting them to a predefined value.

Alibi Detect is an open source Python library for outlier, adversarial and drift detection. In this article, we focus on the outlier detection possibilities of the library. As a matter of fact, Alibi Detect covers several outlier detection methods such as Mahalanobis distance, Isolation forest and Seq2seq. The library also handles several types of input data such as tabular, image, text or time series. According to the type of the data, different types of algorithms are required. The table below summarizes the different algorithms and corresponding data types:

In this article, we focus on two different algorithms: Isolation Forest and Mahalanobis distance for tabular data treatment. If you want to see further algorithms and examples, I recommend checking the library’s documentation and this article about image outlier detection using an autoencoder on Alibi Detect.

Isolation Forest

Isolation Forest is an unsupervised outlier detection technique based on decision trees that works on the principle of detecting and isolating the outliers instead of the “normal” observations. It is based on the fact that outliers are distinct from normal observations and thus are easily isolated.

More specifically:

1- It recursively and randomly selects one attribute

2- It splits the data according to a random value between the minimum and maximum values of this attribute

3- Each observation will have either higher or lower attribute value than the splitting random value

4- These splitting iterations are repeated until the observation is isolated from the rest of the data points. If the observation is an outlier, it will have attribute values far from the rest of the dataset, and therefore, after only a few splits, we can fully isolate it; whereas an inlier has values close to the dataset distribution, thus if we split only using a few steps, we still have several data points at the same category (higher or lower than the random split value).

As you can see in the example above of a 2-D dataset (with two features), the outlier is easily detectable, it only requires two random splits to be isolated compared to the normal data requiring more splits.

We define:

  • An observation x with m features
  • h(x) as the path length to find the observation x from the root of a tree. It is the number of splits done until the observation x is fully isolated
  • E(h(x)) as the average path length over all the trees of the Isolation Forest to isolate the observation x: it is the average estimation from all the trees of the forest of the number of splits needed to distinguish the observation x
  • c(m) estimates the average depth of the forest. As described in Wikipedia, “The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that of Binary Search Trees (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST. As a consequence, the estimation of average h(x) for external node terminations is the same as that of the unsuccessful searches in BST” ie:

Where : H(i) = ln(i) + γ is the harmonic number and γ is the Euler-Mascheroni constant.

Therefore, mathematically, Isolation Forests isolate outliers using an anomaly score described as:

In other terms, the anomaly score gives an estimation of the number of iterations needed to isolate an observation x compared to the average iterations needed for the dataset of size m. We distinguish different use cases:

  • If the observation x is an outlier, then it will need fewer iterations to be isolated compared to normal observations, consequently E(h(x)) will be low compared to c(m) and s(x,m) will be close to 1
  • If the observation is an inlier, s(x,m) is close to 0.5

The Alibi Detect Isolation Forest algorithm modifies the anomaly score to make it easier to interpret: scores close to 0 are related to inliers and close to 0.5 related to outliers.

Mahalanobis Distance

Mahalanobis distance is a measure of the distance between an observation and a distribution. More specifically, it gives a measure of how many standard deviations an observation is far from the center of a distribution.

Consider a set of observations of size N (N columns), a mean value:

and covariance matrix C. The Mahalanobis distance of an observation

to the distribution is :

The advantage of this general formulation compared to the Euclidean distance for which C is the identity matrix, is that it takes into account the distribution of the data using the covariance matrix, as shown in the formula above. The covariance matrix informs how variables evolve together, therefore using it in the distance gives more accurate detection of the outliers. Please see the figure below for a concrete example: the data is 2-D dimensional (it has two features). From the Euclidean point of view, observation A is closer to the center of the data compared to observation B which is relatively far from the center. Thus, using this method, we would classify A as inlier and B as outlier. Nonetheless, we see that the data follows a distribution that presents a correlation between the two features x and y. Thereby, B is generated from the same distribution as the data and is considered an inlier while A does not belong to this distribution and is thus considered an outlier.

In the Alibi Detect library, the algorithm calculates an outlier score, which is the distance from the center of the distribution. Besides, the user defines an outlier threshold, thus, if the score is higher than the threshold, the observation is flagged as an outlier. Note that the method is convenient for low to medium dimensional tabular data.

Practice makes perfect: tabular data outlier detection

In this section, we’re going to use both algorithms explained above and implemented on Alibi Detect to detect outlier data points in a tabular dataset.

For this, we create a dataset of 10000 observations, of 2 features “x” and “y” having 2% of outliers data created using a random noise from the original dataset.

1- Isolation Forest outliers detection:

We split the dataset into train set and test set, and we train the model on the normal data only, as the purpose is to learn the normal behavior and distinguish it from an outlier behavior.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# We split data into train test datasets
X_train, X_test, y_train, y_test = train_test_split(
data[["x", "y"]],
pd.DataFrame(data["is_outlier"]),
test_size=0.2,
random_state=42)
# We standardize the data based on the NORMAL points only: the #outliers detection algorithm needs to learn the normal distribution
# and based on it we detect the abnormal (outlier) values
X_train_normal = X_train.loc[y_train[y_train.is_outlier==0].index]

In practice there are two possibilities to define an observation as an outlier:(1) either we fix a threshold score above which an observation is said to be an outlier, or (2) the top epsilon % of the observation are said to be outliers. In both cases, the threshold or the percentage has to be chosen by cross validation. It could also be interesting to perform some exploratory analysis to have a better idea of the values to test.

In this test, we knew the share of outliers from our dataset (2%), therefore we chose the first solution and tested different percentage scores. The model will flag the epsilon% of the data with the highest outlier scores as outliers.

from  alibi_detect.od import IForest
from sklearn.metrics import f1_score, precision_score, recall_score
results = []
for epsilon in [20., 5., 2., 1.]:
# Create Isolation Forest model
od = IForest(
max_samples = max_sample
)
# Fit the model
od.fit(
X_train_normal
)
od.infer_threshold(
X_train,
threshold_perc = 100 - epsilon # percentage of normal data
)
# Detect outliers
preds = od.predict(
X_test,
return_instance_score=True
)
# Check performance
y_pred = preds["data"]["is_outlier"]
f1 = f1_score(y_test.is_outlier, y_pred)
precision = precision_score(y_test.is_outlier, y_pred)
recall = recall_score(list(y_test.is_outlier), list(y_pred))
results.append([100 - epsilon, f1, precision, recall])

results = pd.DataFrame(results,
columns = ["percentage_inliers",
"f1_score", "precision", "recall"])

Mahalanobis distance outliers detection:

For this part, we evaluate the outliers’ detection on the same data used in the Isolation Forest. Unlike the Isolation Forest, the documentation specified that we don’t need to fit the model to the data unless we have categorical data, which is not the case. We therefore predict directly the outliers in our specific dataset using the distance. We mainly precise a threshold distance from the center of the distribution above which the data point is marked as an outlier. For this, we try different threshold values:

from  alibi_detect.od import Mahalanobisresults = []
for th in [1, 5, 10.0, 15.0, 20.0, 30, 35 ] :
od = Mahalanobis(
threshold=th
)
preds = od.predict(
np.array(X_test),
return_instance_score = True
)
y_pred = preds["data"]["is_outlier"]
f1 = f1_score(y_test.is_outlier, y_pred)
precision = precision_score(y_test.is_outlier, y_pred)
recall = recall_score(y_test.is_outlier, y_pred)
resultxs.append([th, f1, precision, recall])
results = pd.DataFrame(results,
columns = ["threshold", "f1_score",
"precision", "recall"])

Results analysis:

The modeling results in the graph below show that:

  • For the Isolation forest model, choosing inliers percentage close to the real threshold considerably improves the ability to detect the outliers (with the higher performance being for the real world share of outliers)
  • For the Mahalanobis distance method, if the threshold is very low, most of the normal data points will be detected as outliers, leading to bad performances: in fact, as the distance is low, most of the data points will be detected as outliers, and not only the real outliers (explained in the graph below by a low precision and a high recall). On the other hand, if the threshold is high, only far outliers will be detected correctly whereas closer outliers will be detected as normal data, leading also to a bad performance (explained by high precision and very bad recall in our evaluation process)

Conclusion

In this article, we presented a practical python library for anomalies detection, Alibi Detect. We deep dived into two main methods: Mahalanobis distance and Isolation Forest. Although it is quite straightforward and easy to manipulate the different algorithms, the library adds hyperparameters to the models (outliers share, outliers score threshold). Therefore, it is important to explore the dataset (for example by plotting the data and using statistical analysis) in order to have an estimation of the inliers and outliers’ distribution (especially that in real world data, data points are not flagged as outliers or inliers as it is the case in our testing dataset).

References:

1- https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf

2- https://en.wikipedia.org/wiki/Mahalanobis_distance

3- https://en.wikipedia.org/wiki/Outlier

--

--