Effective Outlier Detection Techniques in Machine Learning

Mehul Ved
5 min readApr 6, 2018

--

From a Machine Learning perspective, tools for Outlier Detection and Outlier Treatment hold a great significance, as it can have very influence on the predictive model. In this blog, we’d address a few techniques in Outlier Detection.

Read the blogpost (link mentioned below), to understand more about Outlier and Anomaly Detection.

Before we deep dive into the subject, let us understand briefly about Outliers and the significance of Outlier Detection.

Outlier Detection Techniques

What are Outliers?

In Data Science, an Outlier is an observation point that is distant from other observations. An Outlier may be due to variability in the measurement or it may indicate experimental error.

Outliers, being the most extreme observations, may include the sample maximum or sample minimum, or both, depending on whether they are extremely high or low. However, the sample maximum and minimum are not always outliers because they may not be unusually far from other observations.

Figure 1 below provides a visual understanding about Outliers

Figure 1: A Visual Understanding of Outliers

Outliers exist due to one of the four following reasons:

· Incorrect data entry can cause data to contain extreme cases.

· A second reason for outliers can be failure to indicate codes for missing values in a dataset.

· Another possibility is that the case did not come from the intended sample.

· And finally, the distribution of the sample for specific variables may have a more extreme distribution than normal.

The Importance of Outlier Detection

While Outliers, are attributed to a rare chance and may not necessarily be fully explainable, Outliers in data can distort predictions and affect the accuracy, if you don’t detect and handle them.

The contentious decision to consider or discard an outlier needs to be taken at the time of building the model. Outliers can drastically bias/change the fit estimates and predictions. It is left to the best judgement of the analyst to decide whether treating outliers is necessary and how to go about it.

Treating or altering the outlier/extreme values in genuine observations is not a standard operating procedure. If a data point (or points) is excluded from the data analysis, this should be clearly stated on any subsequent report.

Figure 2 below illustrates how the Line of Fit Changes drastically, before discarding the Outliers and after discarding the Outliers

Figure 2: A Simple Case of Change in Line of Fit with and without Outliers

The Various Approaches to Outlier Detection

Univariate Approach:

A univariate outlier is a data point that consists of an extreme value on one variable.

The Box Plot Rule

For a given continuous variable, outliers are those observations that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ is the difference between 75th and 25th quartiles. This is also known as “The Box Plot Rule”.

The box plot rule is the simplest statistical technique that has been applied to detect univariate outliers. Typically, in the Univariate Outlier Detection Approach look at the points outside the whiskers in a box plot.

Figure 3: The Box Plot Rule for Univariate Outlier Detection

Grubb’s Test for Univariate Analysis:

Grubb’s test (also known as the maximum normed residual test) is widely used to detect anomalies in a univariate data set, under the assumption that the data is generated by a Gaussian distribution

Multivariate Approach:

Declaring an observation as an outlier based on a just one (rather unimportant) feature could lead to unrealistic inferences. When you have to decide if an individual entity (represented by row or observation) is an extreme value or not, it better to collectively consider the features (X’s) that matter.

A multivariate outlier is a combination of unusual scores on at least two variables.

Several methods are used to identify outliers in multivariate datasets. Two of the widely used methods are:

· Mahalanobis Distance

· Cook’s Distance

Mahalanobis Distance:

Mahalanobis distance and leverage are often used to detect outliers, especially in the development of linear regression models. A point that has a greater Mahalanobis distance from the rest of the sample population of points is said to have higher leverage since it has a greater influence on the slope or coefficients of the regression equation. Mahalanobis distance is also used to determine multivariate outliers.

In order to use the Mahalanobis distance to classify a test point as belonging to one of N classes, one first estimates the covariance matrix of each class, usually based on samples known to belong to each class. Then, given a test sample, one computes the Mahalanobis distance to each class, and classifies the test point as belonging to that class for which the Mahalanobis distance is minimal.

Mahalanobis Distance

Cook’s Distance:

Cook’s distance is a measure computed with respect to a given regression model and therefore is impacted only by the X variables included in the model. But, what does cook’s distance mean? It computes the influence exerted by each data point (row) on the predicted outcome.

The Cook’s distance for each observation i measures the change in Y-hat (fitted Y) for all observations with and without the presence of observation i, so we know how much the observation i impacted the fitted values.

Cook’s Distance

In general use, those observations that have a cook’s distance greater than 4 times the mean may be classified as influential. This is not a hard boundary.

Simplifying Approach Selection for Outlier Detection

Figure 5 below, is a general guideline on selecting an approach for Outlier Detection.

General Guiding Principle to Outlier Detection Approach Selection

Summary & Conclusion:

The contentious decision to consider or discard an Outlier needs to be taken at the time of building the model. Outliers can drastically bias/change the fit estimates and predictions. It is left to the best judgement of the analyst to decide whether treating outliers is necessary and how to go about it.

--

--