Improving Data Quality with Outlier Detection Techniques: A CRISP-DM Approach to Data Preparation

Data Mastery Series — Episode 6: Outlier Detection

Donato_TH
Donato Story
7 min readFeb 23, 2023

--

If you are interested in articles related to my experience, please feel free to contact me: linkedin.com/in/nattapong-thanngam

CRISP-DM framework (Image by Author)

Outlier detection is the process of identifying unusual observations in a dataset that deviate significantly from the expected values. Outliers can occur due to various reasons, such as measurement errors, data entry errors, or genuine extreme events. Detecting outliers is important in many fields, including finance, healthcare, and manufacturing, where anomalies can have significant consequences. In this article, we will discuss various outlier detection methods, including the Z-Score, Modified Z-Score, Standard Deviation, Box Plot, DBSCAN, Isolation Forest, LOF, Mahalanobis Distance, One-class SVM, and Cook’s Distance methods.

Causes of Outliers:

Before discussing the outlier detection methods, it is essential to understand the causes of outliers. Outliers can occur due to the following reasons:

  • Measurement errors: These are errors that occur during the data collection process. For example, a sensor malfunctioning can result in an extreme value being recorded.
  • Data entry errors: These are errors that occur during the data entry process. For example, a human operator can mistakenly enter a value that is several orders of magnitude higher or lower than the actual value.
  • Genuine extreme events: These are events that are rare but have a significant impact on the data. For example, a natural disaster can result in a large spike in the number of insurance claims.
  • Fraudulent activity: These are intentional attempts to manipulate data for personal gain. For example, a fraudulent financial transaction may result in an extreme value that needs to be identified and flagged.

Benefits of Outlier Detection:

  • Helps identify and flag unusual data points that may have a negative impact on statistical analyses, machine learning models, and other data-driven applications.
  • Enables the detection of anomalies and errors in datasets that may otherwise go unnoticed, leading to more accurate and reliable results.
  • Helps to uncover valuable insights and opportunities that may be hidden in the data, such as rare events, unique patterns, and unexpected correlations.
  • Improves the quality of decision making and risk management by identifying outliers that may indicate fraudulent activity, safety hazards, or other high-risk situations.
  • Increases the efficiency of data analysis by reducing the amount of irrelevant or misleading data that needs to be processed.
  • Enables better data visualization and communication by highlighting important features and trends in the data that may be hidden by outliers.

Examples of applications:

  • Fraud detection in financial transactions
  • Anomaly detection in system logs and network traffic
  • Quality control in manufacturing processes
  • Medical diagnosis based on patient data
  • Cybersecurity threat detection
  • Outlier removal in data cleaning and preprocessing
  • Environmental monitoring for pollution or contamination
  • Detection of rare events in scientific research
  • Predictive maintenance in industrial equipment and machinery
  • Anomaly detection in social media and web analytics.

Data Set:

  • Assume data as following
Example Dataset (Image by Author)

12 Outlier Detection Methods:

12 Outlier Detection Methods (Image by Author)
  • Z-Score:
    The Z-Score method is a simple and widely used method for detecting outliers. It calculates the standard deviation of the data and identifies observations that fall outside a certain number of standard deviations from the mean. Typically, a data point with a Z-score greater than a threshold (e.g., 3) is considered an outlier. Z-score threshold of 3 is assumed for following outlier detection.
Outlier detection by Z-Score (Image by Author)
  • Modified Z-Score:
    The Modified Z-Score method is a variant of the Z-Score method that uses the Median Absolute Deviation (MAD) instead of the standard deviation to calculate the threshold. The MAD is less sensitive to outliers than the standard deviation and provides a robust estimate of the spread of the data. The threshold of 3 is assumed for following outlier detection.
Outlier detection by Modified Z-Score (Image by Author)
  • Standard Deviation:
    The Standard deviation method is another popular method for outlier detection that uses the standard deviation of the data. A data point with a distance from the mean greater than a threshold (e.g., 3 times the standard deviation) is considered an outlier. The threshold of 3 is assumed for following outlier detection.
Outlier detection by Standard Deviation (Image by Author)
  • Box Plot:
    The Box plot method is a graphical method for outlier detection that uses the interquartile range (IQR) to identify outliers. A box plot represents the distribution of the data using a box and whiskers. The box represents the IQR (the range between the 25th and 75th percentiles), while the whiskers represent the range of the data outside the IQR. A data point outside the whiskers is considered an outlier.
Outlier detection by Box Plot (Image by Author)
Box Plot Visualization (Image by Author)
  • Histogram-based:
    The Histogram-based method is a non-parametric method that uses the frequency distribution of the data to identify outliers. The method divides the data into bins and counts the number of data points in each bin. A data point in a bin with a low frequency is considered an outlier. The upper and lower threshold of 2 are assumed for following outlier detection.
Outlier detection by Histogram-based (Image by Author)
Histogram Visualization (Image by Author)
  • Principal Component Analysis (PCA):
    PCA is a dimensionality reduction method that can also be used for outlier detection. PCA transforms the data into a new set of uncorrelated variables (principal components) that capture the most significant variation in the data. The method can be used to identify outliers by examining the scores of the principal components. A data point with a score that is far from the mean of the scores is considered an outlier. The percentile threshold of 95 are assumed for following outlier detection.
Outlier detection by PCA (Image by Author)
  • DBSCAN:
    The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering method that can also be used for outlier detection. The method identifies dense regions of the data and labels the data points that are not part of any dense region as outliers. DBSCAN requires the specification of two parameters: the minimum number of points required to form a dense region (minPts) and the maximum distance between neighboring points in a dense region (epsilon). The minPts and dense region are assumed as 3 and 0.8 for following outlier detection.
Outlier detection by DBSCAN (Image by Author)
  • Isolation Forest:
    The Isolation Forest method is an ensemble method for outlier detection that uses decision trees to isolate outliers. It works by randomly selecting a feature and a split value for each decision tree and identifying observations that require fewer splits to be isolated.
Outlier detection by Isolation Forest (Image by Author)
  • LOF
    LOF (Local Outlier Factor) is a density-based method that measures the local density of a data point relative to its neighbors. A data point with a significantly lower density than its neighbors is considered an outlier. The percentile threshold of 10 are assumed for following outlier detection.
Outlier detection by LOF (Image by Author)
  • Mahalanobis Distance:
    The Mahalanobis Distance method is a multivariate method for outlier detection that takes into account the correlations between the features. It measures the distance of a data point from the mean of the data in units of standard deviation, taking into account the covariance matrix of the data. Observations that have a high Mahalanobis distance are considered outliers.
Outlier detection by Mahalanobis Distance (Image by Author)
  • Cook’s Distance:
    The Cook’s Distance method is a regression-based method for outlier detection that measures the influence of a data point on the regression model. It works by iteratively fitting the regression model with and without each observation and measuring the difference in the parameter estimates. Observations that have a large Cook’s distance are considered influential and potentially problematic.
Outlier detection by Cook’s distance (Image by Author)
  • One-class SVM:
    The One-Class SVM method is a support vector machine-based method for outlier detection that learns a decision boundary that separates the data from the origin in the feature space. Observations that fall on the other side of the decision boundary are considered outliers.
Outlier detection by One-class SVM (Image by Author)

Outlier detection is an essential step in data preprocessing and analysis, as outliers can have a significant impact on data analysis and modeling. In this article, we have reviewed several outlier detection methods, including Z-score, Modified Z-score, Standard deviation, Box plot, Histogram-based, PCA, DBSCAN, Isolation Forest, LOF, Mahalanobis distance, One-class SVM, and Cook’s distance. Each method has its strengths and weaknesses and should be selected based on the characteristics of the data and the application. By using appropriate outlier detection methods, data analysts and researchers can obtain more accurate and reliable results from their data.

Please feel free to contact me, I am willing to share and exchange on topics related to Data Science and Supply Chain.
Facebook:
facebook.com/nattapong.thanngam
Linkedin:
linkedin.com/in/nattapong-thanngam

--

--

Donato_TH
Donato Story

Data Science Team Lead at Data Cafe, Project Manager (PMP #3563199), Black Belt-Lean Six Sigma certificate