OUTLIERS

kalla Srikanth
5 min readSep 18, 2022

--

outlier is an extremely high or extremely low data points relative to the nearest data point and the rest of the neighboring co-existing values in a data graph or dataset you’re working with. Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.

How the Outliers Arise

Outliers arise due to changes in system behavior, human error, instrument error or simply through natural deviations in populations. A sample may have been contaminated with elements from outside the population being examined.

Effects of outliers

An outlier is an unusually large or small observation. Outliers can have a disproportionate effect on statistical results, such as the mean, which can result in misleading interpretations.

An outlier is a reason to variance, which are misleads the data and the model will not performs well.

Detecting of Outliers

Visualization Method

In visualization the boxplots are helpful to detecting the outliers, the box plots are very efficient in finding the outliers of data.

  • Median (Q2/50th percentile): The middle value of the data set
  • First Quartile (Q1/25th percentile): The middle number between the smallest number (not the “minimum”) and the median of the data set
  • Third Quartile (Q3/75th percentile): The middle value between the median and the highest value (not the “maximum”) of the dataset
  • Interquartile Range (IQR): 25th to the 75th percentile
  • Outliers: A value that “lies outside” (is much smaller or larger than) most of the other values in a set of data
  • “maximum”: Q3 + 1.5*IQR
  • “minimum”: Q1 -1.5*IQR

Example on a working dataset

The out side black dots are outliers of data.

Statistical Test ( Z scores ) Method

Z score is an important concept in statistics. Z score is also called standard score. This score helps to understand if a data value is greater or smaller than mean and how far away it is from the mean. More specifically, Z score tells how many standard deviations away a data point is from the mean.Z

score = (x -mean) / std. deviation

A normal distribution is shown below and it is estimated that
68% of the data points lie between +/- 1 standard deviation.
95% of the data points lie between +/- 2 standard deviation
99.7% of the data points lie between +/- 3 standard deviation

Z score and Outliers:
If the z score of a data point is more than 3 & less than -3 , it indicates that the data points is quite different from the other data points. Such a data points are called as outliers.

Working Example :

The total outliers followed as

Inter Quartile Range (IQR) Method

One common technique to detect outliers is using IQR (interquartile range). In specific, IQR is the middle 50% of data, which is Q3-Q1. Q1 is the first quartile, Q3 is the third quartile, and quartile divides an ordered dataset into 4 equal-sized groups. In Python, we can use percentile function in NumPy package to find Q1 and Q3

example on working data

From the image the data points which are above the value 73.77 and below the value 62.85 are the outliers of the particular data.

These are the three ways to detecting the outliers .

Treating the outliers

Arbitrary Outliers Method

The Arbitrary Outlier Capper() caps the maximum or minimum values of a variable at an arbitrary value indicated by the user.

The user must provide the maximum or minimum values that will be used to cap each variable in a dictionary {feature: capping value}

Example for worked data

In this example the previously shown outliers are capped to maximum and minimum values of data.

Winsorizing method ( inter-quantile range proximity rule)

IQR limits:

  • right tail: 75th quantile + 3* IQR
  • left tail: 25th quantile — 3* IQR

where IQR is the inter-quartile range: 75th quantile — 25th quantile.

The method will Automatically calculates the IQR ranges , and the maximum minimum values of particular given data, and caps automatically the outliers to max ,min values or according to the parameter fold.

fold gives the value to multiply the IQR.

Example on worked data set

Winsorizing method ( Gaussian approximation)

Gaussian limits:

  • right tail: mean + 3* std
  • left tail: mean — 3* std

The method will Automatically calculates the standard deviation , and the maximum minimum values of particular given data, and caps automatically the outliers to max ,min values or according to the parameter fold.

fold gives the value to multiply the std.

Example on worked data set

Winsorizing method (percentiles)

percentiles or quantiles:

  • right tail: 95th percentile
  • left tail: 5th percentile

The method will Automatically calculates the percentiles, and the maximum minimum values of particular given data, and caps automatically the outliers to max ,min values or according to the parameter fold.

fold is the percentile on each tail that should be censored. For example, if fold=0.05, the limits will be the 5th and 95th percentiles. If fold=0.1, the limits will be the 10th and 90th percentiles.

Example on worked data set

--

--