Outliers make us go MAD: Univariate Outlier Detection

6 min readMay 17, 2018

Outliers are observations that deviate markedly from other observations of the same sample.

They can either be the result of natural variability, or they may be the result of errors in the data; In either case, detecting and dealing with outliers is an essential part of the creation of a credit risk model.

To showcase the problems outliers can cause, let’s take a look at a very simple lending dataset without outliers; This will serve as our baseline.

In this dataset, there are only two datapoints about each debtor: “debt_to_income” and “default_status”.

default_status:

Tells us whether the debtor ended up defaulting on their loan or not. If it equals 0, it means they paid back the loan; If it equals 1, they did not pay back the loan.

debt_to_income (DTI):

The value of the monthly debt payments of the debtor, divided by their gross monthly income.
This means a client with a DTI of 0.2 spends 20% of their monthly income on monthly debt payments.

We want to build a predictive model that tells us how likely it is for a debtor to default on a loan, given their DTI.

Intuitively, we can already guess that if this ratio is very high, the debtor is likely to default on the loan.

Let’s take a look at the data:

Here we can see that debtors with higher DTI default more often.

In other words, as the DTI increases, so does the probability that the debtor will default on the loan.

We can train a Logistic Regression model to predict the the probability of default, given a certain DTI:

The x-axis shows us the DTI.
The y-axis show us the predicted probability of default.
The line shows us the shows us the predicted probability of default for each DTI value, using the Logistic Regression Model.
The blue points correspond to debtors who did not default, and the red points correspond to debtors who defaulted.

While the Logistic Regression misclassifies points with a DTI around 0.3, it mostly works well; These misclassifications are to be expected, as there is a natural overlap between defaulters and non-defaulters in that region.

But what if our data was contaminated by a single outlier, with a DTI of 6?

If that outlier is a defaulter, then generally that wouldn’t be a problem; It follows the observed trend of higher DTI values resulting in higher probability of default.
If that outlier is a non-defaulter, it would greatly affect the model. Let’s see:

Before, our model considered a debtor with a DTI of 0.4 as extremely likely to default.
Now, the model considers a debtor with a DTI 0.4 as barely more likely to default, rather than not default.

This happened because this single outlier greatly counters the general trend of increasing DTI resulting in higher probability of default, thus affecting the quality of the model.

How do we detect outliers?

A common practice, described in multiple BASEL Committee publications, is to detect outliers using percentiles, and then truncate or remove them; For example, we can consider as outliers the top 1% and bottom 1% of values.

Marked in red are the points considered outliers; As you can see, it detects the outlier that was affecting our model!

However, it also considers as an outlier the lowest value; This is a consequence of using percentile based methods.

What happens when we have more data? What if instead of having 94 observations, we had 1000 observations?

Using the same percentile based method, the value of 6 is still considered an outlier, but we are also rejecting points that are otherwise perfectly acceptable!

This is a limitation of the percentile based methods: As the number of observations increases, so does the number of observations considered outliers; After all, using a percentile based method will always flat-out reject a certain percentage of our observations.

An alternative: Modified Z-score method

Let’s take a look at an outlier detection method that does not depend on the number of observations.

Knowing that our data is roughly normally distributed, we could use the Z-score method, by which we would consider points to be outliers based on how much they deviate from the mean value; However, the mean is not a robust statistic; It is heavily influenced by outliers, meaning that the outliers we are trying to detect would affect the method itself.

What if we take the same method but, instead of using the mean and standard deviation we use the median and the deviation from the median? The median is a robust statistic, meaning it will not be greatly affected by outliers. This is called the Robust Z-score method, and instead of using standard deviation, it uses the MAD (Median Absolute Deviation). Yes, the title of this post was a terrible pun.

We will need:

x̃ , which is just the median of the sample
MAD, which is calculated by taking the absolute difference between each point and the median, and then calculating the median of those differences.

We can calculate the Modified Z-score like this:

0.6745 is the 0.75th quartile of the standard normal distribution, to which the MAD converges to.

Now we can calculate the score for each point of our sample! As a rule of thumb, we’ll use the score of 3.5 as our cut-off value; This means that every point with a score above 3.5 will be considered an outlier.

Let’s see it in action:

It works correctly for a small sample size …

And it also works for large sample sizes, unlike percentile based methods!

Let’s compare the two methods again, with a few more outliers:

With a small sample size …

And a larger sample size!

While the method described above assumes our data is normally distributed, there are some adaptations we can make if our data is described by another symmetric distribution, or even asymmetric distributions!

In conclusion:

In this post we discussed the limitations of percentiles based methods for univariate outlier detection, and presented an alternative more adequate for larger datasets.

But what if we had a client with 25 years of work experience, but only 20 years of age? Univariate outlier detection methods would not detect this very clear outlier, because they consider only one variable at a time.

In an upcoming blog post we will discuss the detection of multivariate outliers, using not only classical statistical techniques, but also more recent machine learning approaches!

References / Further reading

https://stackoverflow.com/questions/22354094/pythonic-way-of-detecting-outliers-in-one-dimensional-observation-data?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
https://www.pdf-archive.com/2016/07/29/outlier-methods-external/outlier-methods-external.pdf
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
https://www.bis.org/bcbs/publications.htm?m=3%7C14%7C566
Boris Iglewicz and David Hoaglin (1993), “Volume 16: How to Detect and Handle Outliers”, The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.