Outliers in data and ways to detect them.
What actually is an outlier?
An outlier in plain English can be called as an odd man out in a series of data. Outliers can be unusually and extremely different from most of the data points existing in our sample. It could be a very large observation or a very small observation. Outliers can create biased results while calculating the stats of the data due to its extreme nature, thereby affecting further statistical/ML models.
Detecting and dealing with outliers is one of the most important phases of data cleansing.
For e.g. Let us consider below table with these values. Since the table is very small, one look at it gives us an idea that 10000 is an outlier. However, in real life, the data to be dealt with will be very large and it is not an easy task to detect outliers at one look in real scenarios.
The mean of the above observations is 1307 which is higher than most of the values in the table. We all know that mean is the arithmetic average and generally represents that centre of the data. Here, 1307 is nowhere near the centre of the entire data. And the culprit for this is the one extreme observation 10000. Hence, 10000 can be termed as an outlier which distorts the actual structure of the data.
Outliers can be univariate or multivariate.
Ø Univariate outliers are generally referred to as extreme points on a variable. For eg: 10000 in above example.
In this article, we will focus on understanding few of the various ways to detect univariate outliers.
Ø Multivariate outliers are generally combination of unusual data points for two or more variables. Scatter plots are mostly used in multivariate settings which indicate the relationship between the response variable and one or more predictor variables. Sometimes an outlier may fall within the expected range of response variable (x-axis) and the predictor variable (y-axis) but can still be an outlier as it does not fit the model i.e. it does not fit the regression line of the model. Contrary to univariate outliers, multivariate may not necessarily be extreme data points.
Since we now know what outliers are, we will dig through the various ways to identify them.
1. The simplest way to detect an outlier is by graphing the features or the data points. Visualization is one of the best and easiest ways to have an inference about the overall data and the outliers. Scatter plots and box plots are the most preferred visualization tools to detect outliers.
· Scatter plots — Scatter plots can be used to explicitly detect when a dataset or particular feature contains outliers.
Ø In the below image, I have used a data set called “House Price prediction” as an example. Source of the dataset is Kaggle.
Ø As very clearly visible in the graph, the dependent variable “Salesprice” is concentrated more within the range of 0–55000 approx of the feature LotArea and the points above 150000 are very clearly outliers as these can result to disproportionate stats about the overall structure of the data.
Ø Hence, we can graph scatter plots for all the features of the dataset which we suspect may contain outliers.
· Box Plots: Box plot is another very simple visualization tool to detect outliers which use the concept of Interquartile range (IQR) technique.
Ø In the below graph, few of the outliers are highlighted in red circles.
Here we have plotted the Saleprice against the LotConfigurtaion based on year.
Example outlier in below graph : the “corner LotConfig” has an outlier with Saleprice greater than 700000 for the year 2007.
· Histograms can also be used to identify outlier. However in a histogram, existence of outliers can be detected by isolated bars.
Ø If we take our initial example of eight numbers, we can clearly see the outlier 10000 is far to the isolated right bar and the remaining seven data points are towards the left bar all together.
Ø Histograms are generally used in univariate settings where we graph the data distribution of a single variable (Numbers in our case) and identify the outlier(10000) which falls outside of the data distribution as shown below.
2> InterQuartile range (IQR) technique: This method can be used to find the maximum and minimum values of data points that are outliers by calculating the boundaries.
IQR is the middle of the dataset and these are the values between the first quartile and the third quartile and can be calculated as:
As we all know what a median is , it can be referred to as the mid point of a frequency distribution of a dataset or the midpoint of the values under observation.
The lower half of the data set is the set of all values that are below the median value. And the upper half of the data set is the set of all values that are above the median value when the data has been sorted in descending order.
The first quartile denoted by Q1 is the median of the lower half of the data set. Also known as the 25th percentile, it indicates that about 25% of the values in the data set lie below Q1 and about 75% lie above Q1 .
Similarly, the third quartile, denoted by Q3 is the median of the upper half of the data set. Also known as the 75th percentile,it indicates that about 75% of the numbers in the data set lie below Q3 and about 25% lie above Q3 .
The same can be easily visualized using a box plot:
We will discuss on how to calculate the respective quartiles using python in the below steps.
In Python, we can use below steps to achieve IQR and ultimately detect the outliers:
Now suppose, I want to find if a variable Y from dataset “df” has any outliers.
Step 1: First we import the important python libraries like pandas, numPy, sklearn, scipy etc.
Step 2: Import the dataset ‘df’ with below values:
Step 3: Calculating the first and third quartile for variable Y.
First quartile = Q1_Y = df[‘Y’].quantile(0.25)
Step 4: Calculating the interquartile range (IQR):
IQR_Y = Q3_Y — Q_Y
Step 5: The below code will give me the records with outliers on the ‘Y’ variable.
df[np.logical_or(df[‘Y’] < (Q1_Y — 1.5 * IQR_Y), df[‘Y’] > (Q3_Y + 1.5 * IQR_Y))]
# Q1_Y — 1.5 * IQR_Y = 700 — (1.5 * 798) = 700–1197 = -497 — This is the lower boundary to detect the outliers with the minimum value .
# Q3_Y + 1.5 * IQR_Y = 1498 + (1.5 * 798) = 1498 + 1197 = 2695 — This is the upper boundary to detect the outliers with the maximum values.
As we can see in the screenshot above which is also visualized in the box plot below :
We have two minimum value outliers -599 and -978 below the lower boundary -497.
And we have three maximum value outliers 20000, 34000 and 55000 above the upper boundary 2695.
So, these outliers below and above the lower and upper boundary respectively if not detected, could give disproportionate results on the statistical inference of the overall data thereby compromising with the accuracy of the predictive model.
However, there are many ways to deal with outliers and one of them is data capping where the outliers below lower boundary can be replaced by the first quartile — Q1 value and those above the upper boundary can be replaced by the third quartile — Q3 value.
3> There are various statistical tests that can be performed to detect outliers and one of them is the hypothesis testing.
Ø Hypothesis tests are generally used to draw conclusions about an entire population from a random sample on which the test is performed.
Ø Hypothesis tests revolve around below two mutually exclusive statements about a population to determine which statement is best supported by the sample data:
o Null Hypothesis. For e.g. A variable has no outliers.
o Alternative Hypotheses. For e.g. A variable has outliers.
Ø Below three statistical tests use the concept of hypothesis testing to identify outliers.
o Grubbs’ test
o Chi –square test.
o Dixon’s Q test.
Ø In Grubbs’ test and Dixon’s Q test, it is assumed that the data on which we are going to find outliers is normally distributed.
Ø Whereas Chi-square test can be used for the same with the chi-square distribution.
Ø Dixon’s Q test are generally applied for datasets or samples containing very few observations and hence rarely used in data science.
Ø All of the above three tests use the below null and alternative hypothesis to detect outliers.
H0: There are no outliers.
H1: There is at least one outlier.
Ø The outlier detection is concluded based on the P — value and the significance level.
Ø With a significance level of 0.05, if p-value < significance level, the null hypothesis i.e H0: There are no outliers is rejected. It alternately indicates that the alternative hypothesis is accepted i.e H1: There is at least one outlier in the dataset.