Outlier Detection in Data Preprocessing

Wojtek Fulmyk, Data Scientist
4 min readAug 2, 2023

--

Article level: Intermediate

My clients often ask me about the specifics of certain data pre-processing methods, why they’re needed, and when to use them. I will discuss a few common (and not-so-common) preprocessing methods in a series of articles on the topic.

In this preprocessing series:

Data Standardization — A Brief Explanation — Beginner
Data Normalization — A Brief Explanation — Beginner
One-hot Encoding — A Brief Explanation — Beginner
Ordinal Encoding — A Brief Explanation — Beginner
Missing Values in Dataset Preprocessing — Intermediate
Text Tokenization and Vectorization in NLP — Intermediate

Outlier Detection in Dataset Preprocessing — Intermediate

Feature Selection in Data Preprocessing — Advanced

In this specific short writeup I will explain how to find outliers in your dataset. Some understanding of specific terms would be helpful, so I attached a short explanation of the more complicated terminology. Give it a go, and if you need more info, just ask in the comments section!

outliers — Data points far from other observations, differing significantly.

abnormal data points — Values markedly distinct from the dataset’s typical values.

skewing ML analysis — Distorting machine learning model performance and accuracy.

Statistical methods — Mathematical techniques to describe, analyze, and interpret data.

z-scores — Normalized metric indicating data points’ relationship to mean.

IQR — Measure of dataset spread using upper and lower quartiles.

isolation forest — Algorithm isolating anomalies by randomly partitioning data.

autoencoder — Neural network compressing and reconstructing data to find anomalies.

Gaussian — Normal distribution curve symmetrically bell shaped around the mean.

least squares — Optimization method minimizing sum of squared residuals.

Outlier Detection

Identifying and dealing with outliers is a key part of the data analysis. Outlier detection refers to identifying data that is significantly different from the majority of your other data. These outliers can be abnormal data points, fraudulent transactions, faulty sensor readings, etc. Detecting outliers is important for data cleaning so as to avoid skewing ML analysis.

There are various statistical and ML techniques for detecting outliers. Statistical methods rely on things like mean, standard deviation, quantiles, etc. to identify outliers. Machine learning methods use things like isolation forests, one-class SVMs, autoencoders, etc. With so many choices, the technique depends on factors like data size, type of anomaly, how the anomalies will be treated, etc.

Statistical Methods

For smaller datasets, simple statistical methods like z-scores and quantile ranges can be used to identify outliers. For example, z-score measures how many standard deviations an observation is from the mean. A threshold like z=3 can be used to detect potential outliers.

Another commonly used method is to employ the interquartile range; the (IQR) identifies outliers between the 1st and 3rd quartiles. Any observation outside 1.5 * IQR can be considered an outlier.

Machine Learning Methods

Machine learning models like isolation forests, one-class SVMs, or autoencoders are excellent at outlier detection. Isolation forests isolate anomalies rather than simply profiling normal data. They build decision trees that partition data recursively, thus isolating outliers quicker, and with fewer partitions. One-class SVMs learn a boundary around normal data points; new samples outside the boundary are flagged as anomalies. Autoencoders learn compressed representations of data. Samples with high reconstruction error are potential outliers.

Useful Python Code

I will show you how to find outliers using scikit-learn. I will use the EllipticEnvelope estimator that is often used on Gaussian data. For demonstrative purposes, I will only find one outlier per column in the sample generated df. Finally, just to clearly show you where the outliers lie, I will recreate the df with only the outliers visible.

import pandas as pd
import numpy as np
from sklearn.covariance import EllipticEnvelope

# define model
outlier_detector = EllipticEnvelope(contamination=0.01)

# sample dataframe. Some values will naturally become outliers
df = pd.DataFrame(np.random.randn(6, 3))

# identify outliers
outlier_values = pd.DataFrame(columns=['column', 'value'])
for col in df.columns:
# fit on column directly
outlier_detector.fit(df[col].values.reshape(-1, 1))
# predict the outliers
y_pred = outlier_detector.predict(df[col].values.reshape(-1, 1))
# store outlier rows and values
if -1 in y_pred:
outlier_row = df[y_pred==-1].index[0]
outlier_values = pd.concat([outlier_values,
pd.DataFrame({'column': col, 'value': df.loc[outlier_row, col]}, index=[0])])

# create empty mask
mask = pd.DataFrame(False, index=df.index, columns=df.columns)

# set mask to True where values match outliers
for row in outlier_values.itertuples():
mask.loc[df[row.column] == row.value, row.column] = True

# mask all values except outliers
df_outliers = df.mask(~mask)

# show results
print(outlier_values)
print(df_outliers)

This will output the following (values will vary):

  column     value
0 0 1.523088
0 1 -0.611936
0 2 -0.187589
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN -0.187589
4 1.523088 NaN NaN
5 NaN -0.611936 NaN

And that’s all! I will leave you with some “fun” trivia 😊

Trivia

  • One early use of outlier detection happened in 1897 when astronomer Friedrich Winnecke used used least squares and outlier rejection to determine the orbit paths of many celestial objects. By removing outlier measurements that looked different from orbital calculations, he was able to reduce errors in estimating the orbit parameters. This helped to accurately determine the trajectories of, among others, the comets Swift and Barnard.
  • Outlier detection is a primary tool used in fraud detection — spotting strange credit card transactions or insurance claims that don’t fit expected patterns. For credit cards, things like purchase amounts, location, times, etc. are analyzed to detect outliers differing from normal user patterns.

--

--

Wojtek Fulmyk, Data Scientist

Data Scientist, University Instructor, and Chess enthusiast. ML specialist.