Unearthing Outliers: Methods for Detection and Handling

Akash.
4 min readNov 1, 2023

--

Outliers in data science refer to data points that deviate significantly from the majority of the observations in a dataset. These are data values that are unusually distant from the rest of the data points, either higher or lower, and can have a substantial impact on the analysis and interpretation of the data.

Identifying outliers is crucial in data analysis as they can distort statistical measures and lead to erroneous conclusions. Outliers may arise due to various reasons, including measurement errors, experimental anomalies, or genuine extreme values in the underlying population.

Techniques commonly used for detecting outliers in data:

1. Z-Score Method

  • The Z-score measures how far a data point is from the mean of a dataset in terms of standard deviations. Points with a high absolute Z-score (typically greater than 3 or less than -3) are considered potential outliers.
  • The Z-score method measures how many standard deviations a data point is away from the mean. It’s one of the most widely used techniques for identifying outliers.
import numpy as np

def z_score_outliers(data, threshold=2):
mean = np.mean(data)
std_dev = np.std(data)
z_scores = [(x - mean) / std_dev for x in data]
return np.where(np.abs(z_scores) > threshold)[0]


data = [1, 2, 3, 4, 5, 20]
outliers = z_score_outliers(data)
print(f"Outliers: {data[int(outliers)]}")

# Output
# 20

2. IQR (Interquartile Range) Method:

  • The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. Data points outside a certain multiple of the IQR are flagged as outliers.
  • The IQR method identifies outliers by looking at the range between the first quartile (Q1) and third quartile (Q3) of the data.
import numpy as np


def iqr_outliers(data, k=1.5):
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower_bound = q1 - k * iqr
upper_bound = q3 + k * iqr
return np.where((data < lower_bound) | (data > upper_bound))[0]


data = [1, 2, 3, 4, 5, 20]
outliers = iqr_outliers(data)
print("Outliers:", data[int(outliers)])

# Output
# 20

3. Box Plot (Tukey’s Fences):

  • A box plot visually displays the distribution of a dataset, showing the median, interquartile range (IQR), and potential outliers. Tukey’s fences are used to define the whiskers, and points outside these whiskers are considered potential outliers.
  • Box plots are a graphical method for visualizing the distribution of a dataset and identifying potential outliers.
import matplotlib.pyplot as plt

def box_plot(data):
plt.boxplot(data)
plt.show()

data = [1, 2, 3, 4, 5, 20]
box_plot(data)
Output

4. Scatter Plot:

  • A scatter plot is a useful graphical tool for identifying outliers, especially in bivariate data. Outliers may appear as data points that deviate substantially from the general pattern of the scatter plot.
  • Scatter plots can be used to visually identify outliers by plotting the data points in a two-dimensional space.
import matplotlib.pyplot as plt

def scatter_plot(x, y):
plt.scatter(x, y)
plt.show()

x = [1, 2, 3, 4, 5, 20]
y = [2, 3, 4, 5, 6, 30]
scatter_plot(x, y)
Output

5. Histograms and Density Plots:

  • Histograms and density plots provide a visual representation of the distribution of data. Unusually tall or wide peaks in the distribution may indicate the presence of outliers.
import matplotlib.pyplot as plt


def scatter_plot(x):
plt.hist(x)
plt.show()


x = [1, 2, 3, 4, 5, 20]
scatter_plot(x)
Output

Handling outliers is an essential step in data preprocessing to ensure that they do not unduly influence the results of your analysis.

1. Data Transformation:

  • Logarithmic Transformation: Take the logarithm of the data to reduce the impact of extreme values.
  • Square Root Transformation: Similar to logarithmic transformation, this can help stabilize the variance.
  • Winsorization: Replace extreme values with a specified percentile value (e.g 1st and 99th percentile).
# Logarithmic Transformation

import numpy as np

def log_transform(data):
return np.log(data)

data = [1, 2, 3, 4, 5, 20]
transformed_data = log_transform(data)
print("Transformed Data:", transformed_data)
# Square Root Transformation

import numpy as np


def sqrt_transform(data):
return np.sqrt(data)


data = [1, 2, 3, 4, 5, 20]
transformed_data = sqrt_transform(data)
print("Transformed Data:", transformed_data)
# Winsorization

import numpy as np


def winsorize(data, lower_percentile=1, upper_percentile=99):
lower_bound = np.percentile(data, lower_percentile)
upper_bound = np.percentile(data, upper_percentile)
data[int(np.where(data < lower_bound)[0])] = lower_bound
data[int(np.where(data > upper_bound)[0])] = upper_bound
return data


x = [1, 2, 3, 4, 5, 20]
tranformed_x = winsorize(x)
print("Transformed Data:", transformed_x)

2. Data Imputation:

  • For some datasets, imputation might be appropriate. This involves replacing outliers with a statistically derived estimate (e.g., mean, median) of the non-outlier data.
import numpy as np


def impute_outliers(data):
mean = np.mean(data)
std_dev = np.std(data)
return [x if mean - 2 * std_dev <= x <= mean + 2 * std_dev else mean for x in data]


data = [1, 2, 3, 4, 5, 20]
imputed_data = impute_outliers(data)
print("Imputed Data:", imputed_data)

3. Capping

  • Capping involves setting a specific threshold for outliers and replacing any values beyond that threshold.
def cap_outliers(data, lower_threshold, upper_threshold):
return [max(min(x, upper_threshold), lower_threshold) for x in data]


data = [1, 2, 3, 4, 5, 20]
capped_data = cap_outliers(data, lower_threshold=1, upper_threshold=10)
print("Capped Data:", capped_data)

Conclusion

Detecting and handling outliers is a crucial step in the data preprocessing pipeline. Outliers, by their very nature, can significantly skew statistical measures and lead to misleading conclusions. Therefore, it’s imperative to employ effective techniques to identify and appropriately address outliers in our datasets.

--

--