Identifying, Cleaning and replacing outliers | Titanic Dataset

Alamin Musa Magaga
Analytics Vidhya
Published in
7 min readApr 5, 2021

Applying a robust concept to treat outliers

shutterstock.com

Outliers are values in data that differ extremely from a major sample of the data, the presence of outliers can significantly reduce the performance and accuracy of a predictable model.

The measure of how good a machine learning model depends on how clean the data is, and the presence of outliers may be as a result of errors during the collection of data, but some of this extreme values may be valid and legitimate.for example, the comparison of the goal scores of Ronaldo or Messi with other average players ,the earnings of the top actors like Dwayne Johnson and Ryan Reynolds with otherle actors, we can see clearly it is incomparable and the margin will be very significant.

so during data analysis, this score and earnings may appear as an outlier, that is why there is a need for broader and extensive analysis on the data to figure out and differentiate extreme values from outliers.

we are going to use the titanic dataset to identify, clean, and replace outliers. now, let's explore our data and do some basic data preprocessing.

Import libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Read the data set

df=pd.read_csv('titanic.csv')
df.head()
Photo by Magtech Digital Hub
df.dtypes
Photo by Magtech Digital Hub

View the statistical details

df.describe()

Percentage of missing values

missing_values=df.isnull().sum()
missing_values[missing_values>0]/len(df)*100

Visualizing the missing values

sns.heatmap(df.isnull(),yticklabels=False,cbar=False

Dropping the irrelevant columns

df.drop(['PassengerId','Name','Ticket','Cabin'],axis=1,inplace=True)
df.head()

Filling of Missing values

df['Age']=df['Age'].fillna(df['Age'].mode()[0])df['Embarked']=df['Embarked'].fillna(df['Embarked'].mode()[0])

In the lines of code above, we checked the data type, missing values, and drop some irrelevant columns, and filled missing values with the most frequent(mode) values of the columns(Age, Embarked).

Outliers Identification

There are different ways and methods of identifying outliers, but we are only going to use some of the most popular techniques:

  • Visualization
  • Skewness
  • Interquartile Range
  • Standard Deviation

Visualization

Outliers can be detected using different visualization methods, we are going to use :

  • Boxplot
  • Histogram

Boxplot

boxplot is a visualization tool for identifying outliers, it displays the distribution of statistical observations, its body is classified into four parts; the lowest and the highest(minimum and maximum), the 25 percentile(first quartile(Q1)), the median(50th percentile), the 75th percentile(third quartile(Q3)).

outliers appears above or below the minimum and maximum of the boxplot.

towarsdatascience.com

the line of code below plots the boxplot of the ‘Fare’ variable.

sns.boxplot(df['Fare'],data=df)

from the boxplot above, the black circular points which are indicated by an arrow show the presence of extreme values in the variable.

Histogram

To visualize the distribution of a numerical variable, a histogram shows the direction in which these variables are distributed, outliers will appear outside the overall distribution of the data. if the histogram is right-skewed or left-skewed, it indicates the presence of extreme values or outliers.

the code below plots the histogram of the ‘Fare’ variable.

df['Fare'].hist()

from the histogram above, the histogram appears to be distributed to the left, this also indicates the presence of outliers.

Skewness

the skewness value should be within the range of -1 to 1 for a normal distribution, any major changes from this value may indicate the presence of outliers.

the code below prints the skewness value of the ‘Fare’ variable.

print('skewness value of Age: ',df['Age'].skew())
print('skewness value of Fare: ',df['Fare'].skew())

Out[ ]:

skewness value of Age: 0.6577529069911331
skewness value of Fare: 4.787316519674893

the skewness value should be within the range of -1 to 1 for a normal distribution, any major changes from this value indicates the presence of extreme value or outlier.

from the code above, the ‘Fare’ skewness value of 4.78 shows the variable has been rightly skewed, indicating the presence of outliers.

Interquartile Range(IQR)

The interquartile range is a measure of statistical dispersion and is calculated as the difference between 75th and 25th percentiles. the Quartiles divide the data set into four equal parts. The values that separate parts are called the first, second, and third quartiles.

ScienceDirect.com

this code shows the interquartile range value of the ‘Fare’ variable.

Q1=df['Fare'].quantile(0.25)
Q1=df['Fare'].quantile(0.75)
IQR=Q3-Q1

Out[]:

23.0896

the code below prints the outliers and sets the 25th and 75th percentile of the ‘Fare’ variable respectively which will also be used in flooring and capping in the outliers treatment process.

Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
whisker_width = 1.5
Fare_outliers = df[(df['Fare'] < Q1 - whisker_width*IQR) | (df['Fare'] > Q3 + whisker_width*IQR)]Fare_outliers.head()

Standard Deviation

Standard deviation measures the amount of variation and dispersion of a set of values relative to the average value of the data, it shows the variability distribution of the data.

A high standard deviation indicates that the values are highly dispersed while a low standard deviation indicates that the variation or dispersion of the values is low.

Wikipedia

the code below prints the outliers

fare_mean = df['Fare'].mean()
fare_std = df['Fare'].std()
low= fare_mean -(3 * fare_std)
high= fare_mean + (3 * fare_std)
fare_outliers = df[(df['Fare'] < low) | (df['Fare'] > high)]
fare_outliers.head()

Outliers Treatment

  • Flooring and Capping.
  • Trimming.
  • Replacing outliers with the mean, median, mode, or other values.

Flooring And Capping

in this quantile-based technique, we will do the flooring(e.g 25th percentile) for the lower values and capping(e.g for the 75th percentile) for the higher values. These percentile values will be used for the quantile-based flooring and capping.

the code below drops the outliers by removing all the values that are below the 25th percentile and above the 75th percentile of the ‘Fare’ variable.

Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
whisker_width = 1.5
lower_whisker = Q1 -(whisker_width*IQR)
upper_whisker = Q3 + whisker_width*IQR)
df['Fare']=np.where(df['Fare']>upper_whisker,upper_whisker,np.where(df['Fare']<lower_whisker,lower_whisker,df['Fare']))

we can now use the boxplot or other outliers identification method to check if there is still the presence of outliers.

the boxplot below shows no presence of outliers.

sns.boxplot(df['Fare'],data=df)

we now compare the two boxplots with the one before and after the treatment of the outliers.

Trimming

in this method, we removed and completely drop all the outliers, the line of code below creates an index for all data points and drop the index values.

Q1 = df['Fare'].quantile(0.10)
Q3 = df['Fare'].quantile(0.90)
IQR = Q3 - Q1
whisker_width = 1.5
lower_whisker = Q1 - (whisker_width*IQR)
upper_whisker = Q3 + (whisker_width*IQR)
index=df['Fare'][(df['Fare']>upper_whisker)|(df['Fare']<lower_whisker)].index
df.drop(index,inplace=True)

we compare the two boxplots with the one before and after the treatment of the outliers, we still observed that there are a few extreme values that may be newly generated.

Replacing Outliers With The Mean, Median, Mode, or other Values

in this technique, we replace the extreme values with the mode value, you can use median or mean value but it is advised not to use the mean values because it is highly susceptible to outliers.

we now also compare the two boxplots with the one before and after the treatment of the outliers.

magtech dihub

Conclusion

In this blog, you have grasped some simple concepts and different methods of outliers identification using different techniques such as visualization, skewness,inter-quartile range(IQR), and standard deviation with different methods of cleaning outliers such as trimming, flooring, and capping, and replacement of the outliers with mean, median or mode that you can apply to successfully identify and outliers.

Reference

[1] Wikipedia,Interquartile_range (May 2012),https://en.wikipedia.org/wiki/Interquartile_range

--

--

Alamin Musa Magaga
Analytics Vidhya

Data Scientist | Developer | Embedded System Engineer | Zindi Ambassador | Omdena Kano Lead | Youth Opportunities Ambassador | CTO YandyTech