Multivariate joint probabilistic outlier detection through Mahalanobis distance

Rishi Chandra
Epsilon Engineering Blog
7 min readMar 12, 2024

--

By Rishi Chandra

Figure 1: Contours of bi-variate Gaussian distributions representing Pr(x1, x2) with +ve, 0, and -ve correlation between x1, x2. Contour is a surface where the probability is same. Note when x1 and x2 are independent the contour is circular, otherwise its elliptical along a diagonal representing a correlation. (image by author)

Often, when we have multivariate data, univariate outlier removal techniques such as z-score are not enough because they focus on the variations of only one variate and ignore their interactions with the other co-variates present.

Individual values for variates X1 and X2 are not outliers. However their joint co-occurrence is an outlier. Problem definition: We need to identify such outliers owing to rarity of join co-occurrence of the variates.

For example, let x1 and x2 represent the weather conditions in Bangalore for the first half and second half of the day, respectively as shown above. Hence, x1=’very_hot’ is not an outlier because in summers it can be common. Similarly x2=‘very_cold’ is also not an outlier because in winters it can be common. However, x1=’very_hot’ and x2=’very_cold’ both on the same day is not common and clearly is an outlier. Hence, it needs a model that understands the interactions between the variables, identifies which kind of interactions are not common, and labels them as outliers.

Joint probability over variates (x1, x2, … xn) represents Pr(x1 and x2 and …. xn). Joint probability models can identify such un-common interactions (like x1=’very_hot’ and x2=’very_cold’ both on the same day) owing to their extremely low probabilities of occurrence, and Mahalanobis distance is one such model that can identify such outliers in a multivariate world!

Bi-variate Gaussian distribution visualization

Let’s visualize a 2-dimensional joint probability Gaussian distribution. It is shown in figure 2. They are bell shaped. Imagine you put the bell over a 2-dimensional grid of points (x1, x2). Then the height of the bell at each point represents the probability of that point, Pr(x1, x2). This we call a joint probability distribution over x1, x2.

Figure 2, bi-variate joint probability Gaussian distributions representing Pr(x1, x2) with +ve, 0, and -ve correlation between x1, x2. Note the probability dnesity function is bell shaped. Its’ base is of circular shape if x1 and x2 are independent, otherwise the base is elliptical along a diagonal. (image by author)

Let’s start with an understanding of the bi-variate Gaussina distribution model. Figure 2 plots a bi-variate Gaussinan distribution over bi-variates (x1, x2) and Figure 1 displays their corresponding contours. A contour is a surface where the probability is the same.

i. Comprehend it like this: when x1 and x2 are not correlated, then their scatter will be similar along both the x1 and x2 axis since they both have a standard deviation of 1. This is portrayed by their circular contours in the middle figures above.

ii. However, if x1 and x2 are positively correlated, x1 increases, when x2 increases and vice-versa. Thus, the figures on the left above portray this positive correlation behavior owing to their elliptical contours on the major diagonal.

iii. Similarly, the figures on the right above are when x1 and x2 are negatively correlated owing to their elliptical contours over the minor diagonal.

Joint probability model for outlier removal

Figure 3, Pr(red point) is almost 0 as it lies far from the mean. This point is more likely to be an outlier than another point closer to the mean. (image by author)

In Figure 3 above, a plot of the joint Gaussian distribution, we see that its’ shape is like a bell, which flattens out as we go away from the center, i.e. the mean. Hence, if we have the Gaussian distribution for the observed data points, then the points that are far away from the mean will have a very low probability, and hence, are more likely outliers. This is the principal reason behind Mahalanobis distance. But when we already have the Euclidean distance, why do we need the Mahalanobis distance now?

Why not Euclidean and why Mahalanobis distance?

If we have uncorrelated bi-variates then we have circular contours, as shown in the middle of Figure 1 above. In such a case, the Euclidean distance will be good. Here we can say any point at a radial distance greater than say 2 from the mean (0,0) can be identified as an outlier in this case. However, the same cannot be done for the other cases where the contours are not circular, as explained in Figure 4 below:

Figure 4, image on the left is the joint probability density function and some points are sampled from it on the right. We draw a contour in green on the right at a standard deviation of say 2. Hence, the green point is within 2 standard deviations and thus is not an outlier. Though the red point is at the rame Euclidean distance = r from the origin as the green point, the red one is is an outlier since it is more than 2 standard deviations away from the origin (mean). This green contour though is varying in Euclidean distance from origin, has the same Mahalanobis distance from the origin! (image by author)

In Figure 4: the left image is a bi-variate joint Gaussian distribution from which some points are sampled, as shown on the right. There are two points: one in red and the other in green, both at an equal Euclidean distance of r from the center/mean. A contour is drawn in green, which is at a distance of, say, 2 standard deviations from the center. We notice that the point in green is within the contour and the red one outside it. This says that the red point is more than 2 standard deviations away and hence can be an outlier, whereas the green is not. To represent this distance accurately in terms of x standard deviations, we use the Mahalanobis distance (D), defined as:

D = Mahalanobis distance

x: is the n dim. data point for which we want to calculate the distance.

𝜇: mean for x.

Σ = covariance matrix for x.

These contours, though they vary in Euclidean distance from the mean, have the same Mahalanobis distance, from the mean! For details on Mahalanobis distance you can see:

https://www.sciencedirect.com/topics/mathematics/mahalanobis-distance

Now let’s see some hands-on coding for this.

Multivariate outlier detection with Mahalanobis distance

i. We will synthetically generate 2-dimensional points representing rows containing two features from a bi-variate Gaussian distribution.

ii. We will add some outliers to the set.

iii. We will do outlier elimination using uni-variate techniques such as z-score. Note that this is not able to remove all the outliers.

iv. Finally, we will use Mahalanobis distance to eliminate the remaining outliers. But getting the distance alone is not enough because we cannot interpret the distance as such. Hence, we derive the probability of the point from its’ distance, also called the “p-value”.

Generate rows containing 2 features from a bi-variate Gaussian distribution.

import scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

covar = np.array([
[1.0, 0.7],
[0.7, 1.0]
])

# get a 2-d Gaussian distribution function
distr = multivariate_normal(cov=covar, mean=[0,0], seed=47)

# draw 2000 samples from the distribution
data = distr.rvs(size = 2000)
print("data.shape =", data.shape)

# Plotting the generated samples
plt.plot(data[:,0], data[:,1], '.', c='yellow', markeredgewidth=0.5, markeredgecolor='black')
plt.title('covariance(x1,x2) = 0.7')
plt.xlabel('x1')
plt.ylabel('x2')
plt.axis('equal')
plt.show()

# Output: data.shape = (2000, 2)

This generates 2000 rows containing 2 features. Their plot is as shown in Figure 5 below.

Figure 5, a plot of 2000 data samples containing 2 features generated by the code snippet above. Horizontal axis is x1, vertical axis is x2. (image by author)

Add outliers to the data set.

outliers = np.array([
[-1.9, 1.9],
[-2.5, 1.9],
[-1.9, 1.7],
[-1.7, 2],
[-1.5, 1.8],
[-3, 3],
[-2.5, 3.5],
[-3.5, 3],
[3, -3],
[2.5, -3.5],
[3.5, -3],
[1.9, -1.9],
[2.5, -2],
[1.9, -1.7],
[1.7, -1.9],
[1.5, -1.8],
])

# Plotting the samples and outliers
plt.plot(data[:,0], data[:,1], '.', c='yellow', markeredgewidth=0.5, markeredgecolor='black')
plt.plot(outliers[:,0], outliers[:,1], 'o', c='red', markeredgewidth=0.5, markeredgecolor='red')
plt.title('covariance(x1,x2) = 0.7 and outliers in red!')
plt.xlabel('x1')
plt.ylabel('x2')
plt.axis('equal')
plt.show()

# total data points = 2000+16 = 2016
Figure 6, red dots represent outliers as added by the above piece of code. Horizontal axis is x1, vertical axis is x2.(image by author)

Remove outliers: univariate z-score outlier removal.

Remove points where any of the features has a z-score > 2.

# all cols are already scaled to have stdev=1
# hence, a z-score of 2 for each col is at value 2 only
# remove points where any feature has a z-score > 2
indexes = (np.abs(data) <= 2).all(axis=1)
data = data[indexes]

indexes = (np.abs(outliers) <= 2).all(axis=1)
outliers = outliers[indexes]

# Plotting the samples and outliers
plt.plot(data[:,0], data[:,1], '.', c='yellow', markeredgewidth=0.5, markeredgecolor='black')
plt.plot(outliers[:,0], outliers[:,1], 'o', c='red', markeredgewidth=0.5, markeredgecolor='red')
plt.title('covariance(x1,x2) = 0.7 and outliers in red!')
plt.xlabel('x1')
plt.ylabel('x2')
plt.axis('equal')
plt.show()

# data points now = 2000+16 = 2016
Figure 7, some outliers are removed, and some are remaining using the above code. Horizontal axis is x1, vertical axis is x2. (image by author)

Remove all the outliers using Mahalanobis distance.

# add both outliers and data in the same array
data = np.vstack([data, outliers])
print("data.shape =", data.shape)

# Output: data.shape = (1859, 2)

# Mahalanobis outlier treatment !
def calculateMahalanobis(data, cov=None):
""" function to calculate Mahalanobis distance """

data_mu = data - np.mean(data, axis=0)
if not cov:
cov = np.cov(data.values.T)

inv_covmat = np.linalg.inv(cov)
mahalanobis = np.einsum('ij,jk,ki->i', data_mu, inv_covmat, data_mu.T)
return mahalanobis

X = pd.DataFrame(data, columns=["x1", "x2"])

# column 'mahalanobis' now contains the distance metric.
X['mahalanobis'] = calculateMahalanobis(data=X)

# calculate p-value corresponding to each mahalanobis distance
X['p'] = 1 - scipy.stats.chi2.cdf(X['mahalanobis'], X.shape[1]-1)

# drop rows where p-value <= 0.1
X = X[X.p > 0.1]

# drop the cols which are not required now
X = X.drop(columns=["p", "mahalanobis"])
print("X.shape=", X.shape)

# Output: X.shape= (1707, 2)
# Note from a total of 2016 points we dropped
# down to 1707 data points after outlier removal !

We find that all the outliers are removed by the above code snippet as we plot the remaining data points:

# Plotting the samples and outliers
data = X.values
plt.plot(data[:,0], data[:,1], '.', c='yellow', markeredgewidth=0.5, markeredgecolor='black')
plt.title('covariance(x1,x2) = 0.7 and outliers in red!')
plt.xlabel('x1')
plt.ylabel('x2')
plt.axis('equal')
plt.show()
Figure 8, all outliers are removed by the above code snippet! Horizontal axis is x1, vertical axis is x2. (image by author)

Note from a total of 2016 points we dropped down to 1707 data points after outlier removal!

A Note

This assumes that the underlying distribution for the observed data is Gaussian. However, it also works well in cases where it is not entirely true. We can also attempt to make the data represent a Gaussian distribution by applying data Standardization (center the data on origin and divide by standard deviation), and Box-Cox transformations.

For full code please see Github page here.

--

--

Rishi Chandra
Epsilon Engineering Blog

Data Science Manager, Epsilon. Worked in Oracle, Hitachi R&D etc. Master in CSE, IIT-Bombay. Has patents and publications relating to Statistics, ML.