Outliers and how to Handle Them

Eliud Nduati
AfriTech Blurbs
8 min readAug 30, 2021

--

Here’s our next post in Data cleaning Journey, dealing with outliers. When we talk about outliers, we aren’t talking about the ones discussed by Malcolm Gladwell’s “Outliers” though the definition is almost alike.

In this post, we will discuss what outliers are, how they came to be, their effects on our machine learning models, how to identify them and how to deal with them.

To understand what outliers are, we will answer three questions.

What are outliers in Data?

Statistically, Outliers are observations that are distant from other observations.

Let’s try that again: An outlier is an observation that deviates significantly from the rest of the observations or the population.

Ok, let’s try that once more but using examples 😆

  • Given the ages of students in a class as 16, 17, 18, 21, 126, 22; 126 stands out as an outlier since it lies at an abnormal distance from the rest of the ages.
  • If you are provided with names of certain cities such as London, Edinburgh, Las Angeles, Congo, Glasgow and Manchester. Congo stands out as an outlier because it is the name of a country and not a city. It deviates from the rest of the observations.
  • Last one 😁: Can you see the outlier in the following list of animals? Wolf, cat, dog, dogfish, elephant?

Where do outliers come from?

Since we now know what outliers are, let’s try to see where they come from.

  • Data collection & recording errors
  • Variance in the data

It is important to understand the cause of the outliers in our dataset before deciding on the best way to handle them.

What are the possible effects of outliers in ML models?

Having outliers in our data causes issues with ML models. Some of these issues include longer training time, inaccurate models and poor results. It is therefore important to identify and deal with these outliers in our data before embarking on the Model training activities. So how do we identify outliers in our dataset?

Identifying Outliers

Let’s create a DataFrame to use in this activity

# import librariesimport numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
# creating our dataframesalaries = pd.DataFrame()
salaries['Names'] = ['King', 'Josh', 'Rachel', 'Greg', 'Judy','Stacie', 'Jenny', 'Michel', 'Jude', 'Monica']
salaries['Salary'] = [20000, 18000, 21000, 150000, 13000, 21450, 500, 15000, 23000, 17500]
salaries['Ages'] = [23, 22, 21, 23, 65, 24, 25, 23, 20, 25]
salaries['City'] = ['Nairobi', 'Nakuru', 'Thika', 'Kisumu', 'Nairobi', 'New Delhi', 'Mombasa',
'Nairobi', 'Nakuru', 'Thika']
print(salaries)
image by Author

From this simple dataset, it is easier to identify the outliers. However, this will not always be the case. There are different methods of checking for outliers which include using visualization such as Boxplot and scatterplot; using IQR score (Mid-spread or H-Spread)and using Z-score.

a. Visualization method

There are two visualizations that are popular and effective when it comes to identifying outliers.

  • boxplot — according to Wikipedia, a boxplot is a graphical method of depicting groups of numerical data through their quartiles. Whiskers on a boxplot indicate variability. The outliers will appear as individual points on the graph. Let’s see how this is displayed from our dataset above.
sns.boxplot(x=salaries['Salary'])
image by Author

From the chart above, there are two points that are outside the box of the rest of the observations. From our data, we can pinpoint these points as Jenny’s salary of 500 and Greg’s which is 150,000. These are the outliers as they are nowhere closer to the quartiles.

The analysis we have done above is a Univariate outlier analysis. We have only looked at outliers in one variable, the salary column. When we want to view more than one variable and identify outliers, we use a Scatterplot.

  • scatterplot — Wikipedia defines a scatterplot as a type of plot or mathematical diagram using cartesian coordinates to display values for two variables for a data set. The scatterplot will show values for two variables displayed as a collection of points. The points that appear furthest from the group are most likely outliers.
# scatterplot for ages and salaries.fig, ax = plt.subplots(figsize=(15, 8))
ax.scatter(salaries['Ages'], salaries['Salary'])
ax.set_xlabel('Ages')
ax.set_ylabel('Salaries')
plt.show()
image by Author

Most of the data in the scatterplot is consolidated at the bottom left. However, there are certain data that is further from the “crowd” which are our outliers.

b. Z-score method

The Z-score is the signed number of standard deviations by which the value of an observation is above the mean value of what is being observed or measured. Z-score is used to describe any data point by finding the relationship it has with the standard deviation and the man of the group of data points.

Z-score is used to identify outliers by rescaling and centering the data when calculating the Z-score. the data points that are too far from 0 are identified and treated as the outliers. Sometimes/ in most cases, a threshold of 3 or -3 is used where values greater or below 3 or -3 respectively are identified as outliers.

To do this activity, we are going to use the Boston housing dataset from sklearn. This is shown below

from sklearn.datasets import load_bostonboston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names
# creating the datafrae
boston_df = pd.DataFrame(boston.data)
boston_df.columns = columns
boston_df.head()

The above code loads the data into boston_df dataframe which we use below

# checking the z-score of the data in the datasetz = np.abs(stats.zscore(boston_df))
print(z)
image by Author

The output is as shown below. One thing noticeable is that it is hard to check or tell which datapoint is an outlier. To identify the outliers, we use a threshold as shown below

# filtering using a threshold of 3print(np.where(z > 3))
image by Author

The first array in the results is a list of rows while the second is a list of column. Therefore, this means that z[55][1] has a z-score above 3 and therefore it is an outlier. changing the threshold will give different results.

c. IQR method

The IQR (Interquartile range) is a measure of statistical dispersion. It is equal to the difference between the 75th and 25th percentiles (upper and lower quartiles).

IQR = Q3 - Q1

Similar to the Z-score method, the IQR method finds the distribution of the data and then filters it using some threshold to identify the outliers.

# calculating IQR
Q1 = boston_df.quantile(0.25)
Q3 = boston_df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
image by Author

The output is IQR for each column in the dataset. (Try the same with the salaries dataset we had earlier and check out the results)

Below we will use the code provided to give a Boolean output with True where the values are outliers and False where the values are valid. The code prints the first 30 rows in the dataset and checks for the datapoints which meet the defined outlier conditions.

((boston_df < (Q1 - 1.5 * IQR)) | (boston_df > (Q3 + 1.5 * IQR))).head(30)
image by Author

So far so good!

Giphy

Handling outliers

Once you had detected and identified the outliers, the most important decision will be deciding what to do with them. The first option is usually dropping or removing the outliers, next is correcting the outliers.

  1. Dropping them
  • removing outliers using Z-score

We already calculated the z-score above, so here we just create a new dataset without the outliers.

boston_df_cleaned = boston_df[(z < 3).all(axis=1)]
print('Original dataset shape', boston_df.shape)
print("Dataset without outliers' shape", boston_df_cleaned.shape)
image by Author

As we can see above, about 91 rows with outliers have been dropped.

While the simplest method of handling issues with records in our dataset is to drop them, remember, we need to use all the records we have to get the best model. Therefore, dropping values or records from our data denies us a chance to train our models with all the data.

  • removing outliers using IQR score

We use the same approach as with the z-score where we filter out the outliers from our dataset

boston_df_cleaned_out = boston_df[~((boston_df < (Q1 - 1.5 * IQR)) | (boston_df > (Q3 + 1.5 * IQR)).any(axis=1))]

2. Marking the Outliers

Let’s go back to our salaries dataset and assume that the standard salary is 30,000. Anyone who is paid above that is an outlier datapoint

salaries['Outlier'] = np.where(salaries['Salary'] < 30000, 0, 1)salaries

The code above will mark the row with Greg being paid 150,000 as an outlier.

3.Rescaling the data

When we cannot afford to drop or mark and not use the outliers in our model training, it is important to find a way to correct these values and use them in our model. This is where rescaling comes in. It allows the outliers to be used in the model.

salaries['Log of Salaries'] = [np.log(x) for x in salaries['Salary']]
salaries

The output in this case will be

image by Author

From the new column (Log of Salaries), you note that the new values are closer together as they have been scaled and the variance has been reduced.

Conclusion

We have seen how to identify outliers in our data and what problems can result from the outliers. We have then looked at different methods of handling the outliers. Sometimes we might need to try all the methods learnt to see which one works best for our case.

--

--