# Unboxing Outliers In Machine Learning

Outlier is a terminology used by Data Analysts & Data Scientist, also it is very important and needs close attention because if the estimations are not taken properly it can show weird results. I.e. it can skew measurements so that the results are not representative of the actual numbers.

**What is an Outlier?**

An outlier is a data point that is distant from other similar points. Further simplifying an outlier is an observation that lies on abnormal observation amongst the normal observations in a sample set of population. In statistics, an **outlier** is an observation point that is distant from other observations.

**It is also defined as:**

*Outlier is an observation that appears far away and diverges from an overall pattern in a sample.*

**Understanding Outlier with 2 Examples:**

1) Suppose you have a sample of 1000 people, and amongst them all have to choose one colour between Red and Blue.

If 999 choose Red and only one person chooses Blue, I would say that that person is an outlier for that sample.

2) Let’s take an example, we do customer profiling and find out that the average annual income of customers is $0.8 million. But, there are two customers having annual income of $4 and $4.2 million. These two customers annual income is much higher than rest of the population. These two observations will be seen as Outliers.

**Types of Outliers:**

**Univariate Outlier: **A univariate outlier is a data point that consists of an extreme value on one variable.

**Multivariate Outlier**: A multivariate outlier is a combination of unusual scores on at least two variables.

Let try and understand this with an example of **Analytics Vidhya**.

Let us say we are understanding the relationship between height and weight. Below, we have univariate and bivariate distribution for Height, Weight. Take a look at the box plot. We do not have any outlier (above and below 1.5*IQR, most common method). Now look at the scatter plot. Here, we have two values below and one above the average in a specific segment of weight and height.

**Causes of Outliers:**

The ideal way to tackle outlier is to find out the reason of having them in a sample population. The method to deal with them would then depend on the reason of their occurrence. Most common causes of outliers on a data set:

**· Data entry errors** (human errors)

*Example: Annual income of a customer is $100,000. Accidentally, the data entry operator puts an additional zero in the figure. Now the income becomes $1,000,000 which is 10 times higher. Evidently, this will be the outlier value when compared with rest of the population.*

**· Measurement errors** (instrument errors)

*Example: There are 10 weighing machines. 9 of them are correct, 1 is faulty. Weight measured by people on the faulty machine will be higher / lower than the rest of people in the group. The weights measured on faulty machine can lead to outliers.*

**· Experimental errors** (data extraction or experiment planning/executing errors)

*Example: In a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him to start late. Hence, this caused the runner’s run time to be more than other runners. His total run time can be an outlier.*

**· Intentional **(dummy outliers made to test detection methods)

*Example: Teens would typically under report the amount of alcohol that they consume. Only a fraction of them would report actual value. Here actual values might look like outliers because rest of the teens are under reporting the consumption.*

**· Data processing errors** (data manipulation errors)

*Example: While extracting data from multiple sources, it can be possible that some manipulation error on data can happen which may lead to cause outliers in the datset.*

**· Sampling errors** (extracting or mixing data from wrong or various sources)

*Example: We have to measure the height of athletes. By mistake, we include a few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.*

**· Natural **(not an error, novelties in data)

When the outlier is not artificial (due to error), then it is a Natural Outlier.

*Example would be a very hard test given to a small class of 5 students. The test scores were 50,50,50,50, and 100. The average test score would be 60 although almost everyone scored less than that. The score of 100 is an outlier. Either the student is very smart and did well on the test (natural variability) or they cheated (a problem with the test).*

# Now the question arises is **how to detect outliers?**

We can use Visualization to detect Outliers.

Various visualization methods, like **Box-plot**, **Histogram** **and Scatter Plot** can be used.

Some analysts also various thumb rules to detect outliers. Some of them are:

*Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR**Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier**Data points, three or more standard deviation away from mean are considered outlier**Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding.*

**Now let us understand the methods to deal with outliers, **explanation of **Kdnuggets.com** on

**3 Methods of Dealing Outliers**

I hope the above link was useful on Methods on dealing with outliers, moving further we should now understand that **how to remove the outlier**.

The common techniques used to deal with outliers are:

**1)** **Deleting observations: **We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.

**2) Transforming and binning values: **Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.

**3)** **Imputing: **We can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.

**4)** **Treat Outliers separately: **If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.

Lets take an Example of an outlier box plot and understand Outliers.

The data set of N = 90 ordered observations as shown below is examined for outliers:

30, 171, 184, 201, 212, 250, 265, 270, 272, 289, 305, 306, 322, 322, 336, 346, 351, 370, 390, 404, 409, 411, 436, 437, 439, 441, 444, 448, 451, 453, 470, 480, 482, 487, 494, 495, 499, 503, 514, 521, 522, 527, 548, 550, 559, 560, 570, 572, 574, 578, 585, 592, 592, 607, 616, 618, 621, 629, 637, 638, 640, 656, 668, 707, 709, 719, 737, 739, 752, 758, 766, 792, 792, 794, 802, 818, 830, 832, 843, 858, 860, 869, 918, 925, 953, 991, 1000, 1005, 1068, 1441

**The computations are as follows:**

**Median** = (n+1)/2 largest data point = the average of the 45th and 46th ordered points = (559 + 560)/2 = 559.5

**Lower quartile** = .25(N+1)th ordered point = 22.75th ordered point = 411 + .75(436–411) = 429.75

**Upper quartile** = .75(N+1)th ordered point = 68.25th ordered point = 739 +.25(752–739) = 742.25

**Interquartile range** = 742.25–429.75 = 312.5

**Lower inner fence** = 429.75–1.5 (312.5) = -39.0

**Upper inner fence** = 742.25 + 1.5 (312.5) = 1211.0

**Lower outer fence** = 429.75–3.0 (312.5) = -507.75

**Upper outer fence** = 742.25 + 3.0 (312.5) = 1679.75

**Histogram representation with box plot**

The outlier is identified as the largest value in the data set, **1441**, and appears as the circle to the right of the box plot.

**Basic takeaways from above example on Outliers:**

Outliers should be investigated carefully. Often they contain valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, one should try to understand why they appeared and whether it is likely similar values will continue to appear. Of course, outliers are often bad data points.

References:

https://www.kdnuggets.com/2017/01/3-methods-deal-outliers.html

Thank you for reading … !!