A Comprehensive guide on handling Missing Values
Most of the real world data contains missing values. They occur due to many reasons like some observations were not recorded and corruption of data.
Handling missing values is important because most of the machine learning algorithms don’t support data with missing values. These missing values in the data are to be handled properly. If not, it leads to drawing inaccurate inference about the data.
We will be working on Titanic dataset, to explore various methods of handling missing values. Let us get started.
First we will import all necessary libraries. Where Numpy package is for performing Mathematical operations and Pandas package is for manipulating dataset.
# Importing all necessary librariesimport numpy as np
import pandas as pd
Now we will be loading our downloaded titanic dataset using pandas read_csv() into a dataframe. And also we will be creating a copy of that dataframe, so that we will handle the Missing Values using Pandas in one dataframe and using SciKit-Learn using another.
# Loading the datasetdf = pd.read_csv('datasets/titanic.csv')# Copying the dataframenew_df = df.copy()
Let us view first few rows of data to get insights about the data using pandas head().
# First few rows of datadf.head(10)
From the first few rows, we noticed that the dataset comprises of columns like PassengerId, Survived, Pclass, Cabin ..etc, total 12 columns. Missing values (NaN ) are visible in Age and Cabin columns, by viewing first 10 rows of data. We will look into more deep about columns having missing values and no of missing values each column has later. Lets do some Data Exploration to get some insights from it.
Let’s see the dimensions of data. Dimensions mean the number of columns and the number of observations of data, the dataset comprised of. To get the dimesions of data we will be using pandas shape().
# Get the dimensions of the datadf.shape
By executing the pandas.shape, it returned (891, 12). It indicates that the whole dataset comprised of 891 observations and 12 columns in it.
Now let’s check the summary statistics of data using pandas describe() method.
# Summary Statistics of the datadf.describe()
It returned descriptive statistics of all Numerical type column’s. We can view count, mean, median, max ..etc, of each numerical data type column in the dataset. By this summary statistics, we can conclude whether there are outliers present or not which effects mean. For titanic dataset there are no outliers present.
Now check the datatype of each column in the dataset using pandas dtypes.
# View the datatype of each column in the datasetdf.dtypes
A basic exploration of data is done. Now quickly jump into the various techniques to handle the missing values.
Number of Missing Values in each column
To get the number of missing values in each column, we can use pandas isnull() or notnull() methods. isnull() method returns an boolean value TRUE when null value is present whereas notnull() method returns an boolean value TRUE, if there is no null value for every observation in the data. We will add them so that it gives the no of null or non-null values in both cases.
# Number of missing values in the datadf.isnull().sum()
# Number of non - null values in the datadf.notnull().sum()
So from the above observations, null-values are present in Age, Cabin and Embarked columns. In Age, there are about 177 null values, 687 in Cabin and 2 in Embarked. Even you get the same when we subract number of non-null values from total number of observations.
We can also get the number of non-null values of each column with it’s datatype using pandas info() method.
# Number of non - null values in the datadf.info()
Missing values are present in Age, Cabin and Embarked columns. Age is of Numerical Datatype and Cabin, Embarked are of Object Datatype which may be string or a character. Now we are going to look how to handle Missing Values in colums which are of Numerical Datatype.
Either a column is Numerical or Categorical, we can delete the observations having null values in the dataset or the column that is having more number of null values.
Observations that are having null values can be deleted using pandas dropna() method.
# Deleting rowsmod_df = df.dropna()# No of non - null values in the modified datamod_df.info()
So there are only 183 observations in the modified dataset out of 891 observation after dropping rows having null values.
Now let’s see how to drop columns that are having maximum number of null values in them.
# Set the limit
# Drop columns using that limitlimit = len(df_n) * 0.7
new=df.dropna(axis=1, thresh=limit)# View columns in the datasetnew.columns
We have set a limit to 70 percent, so that the columns having null values more than or equal to 70 percent are dropped from the dataframe. Viewing columns to see which column in the dataset has been dropped.
So from the above, we noticed that the Cabin column has been dropped. Since Cabin column consists of nearly 687 null values out of 891, which is 77.12 percent which meets our limit criteria. So Cabin columns has been dropped.
This method of deleting rows/columns is advised only when there are enough samples in the dataset. Dropping of column having more number of null values is to be done only when column deleted doesn’t effect our analysis or that column is out of scope for our model. Removing data leads to loss of information which inturn we will not the desired output while predicting.
Imputing with Mean/Median/Mode/Backward fill/ Forward fill
Null values are replaced with mean/median.mode in this method. This is the statistical method of handling Null values. This method yields good results when compared to removing null values.
The mean of the numerical column data is used to replace null values when the data is normally distributed. Median is used if the data comprised of outliers. Mode is used when the data having more occurences of a particular value or more frequent value.
Let’s replace null values in the Age column by pandas.
# Mean of Age columndf['Age'].mean()
# Replace Null Values (np.nan) with meandf['Age'] = df['Age'].replace(np.nan, df['Age'].mean())# Alternate way to fill null values with meandf['Age'] = df['Age'].fillna(df['Age'].mean())
Let’s check the total number of null values in the Age column
# Checking for null values in Age columndf['Age'].isnull().sum()
We’ve got number of null values in the Age column as 0. Hence null values in Age are replaced by its mean. Wanna have a glimpse, whether they are replaced with mean or any other value start comparing any observation having null value before and after this step. We will notice that mean value is replaced in its position. Similar to that, we can also replace null values with median and mode.
# In the same way we can impute using median and modedf['Age'] = df['Age'].replace(np.nan, df['Age'].median())df['Age'] = df['Age'].replace(np.nan, df['Age'].mode())# Alternate ways to impute null values with median and modedf['Age'] = df['Age'].fillna(df['Age'].median())df['Age'] = df['Age'].fillna(df['Age'].mode())
In addition to this statistical method of filling null values, we can also fill null values by it’s previous value in the column which is called Backward fill or next occurring value in the column which is called Forward fill.
# Backward fill or Forward fill can be used to impute the previous or next values# Backward filldf['Age'] = df['Age'].fillna(method='bfill')# Forward filldf['Age'] = df['Age'].fillna(method='ffill')
Now let’s replace null values in Age column by SciKit Learn.
# Replacing the null values in the Age column with Meanfrom sklearn.impute import SimpleImputerimputer = SimpleImputer(missing_values=np.nan, strategy='mean')# Fit and transform to the parametersnew_df['Age'] = imputer.fit_transform(new_df[['Age']])# Checking for any null valuesnew_df['Age'].isnull().sum()
Hence, the null values are replaced with mean and no null values are present in the Age column. We can also replace them with median as follows
# Alternatively, we can replace null values with median, most frequent value and also with an constant# Replace with Medianimputer = SimpleImputer(missing_values=np.nan, strategy='median')new_df['Age'] = imputer.fit_transform(new_df[['Age']])
Hence the null values in the columns which are of Numerical type are identified and handled. Now lets see different methods of handling categorical data.
Cabin and Embarked columns are of Categorical datatype. Now let’s see how to handle null values which are of Categorical type. View first few rows of these two columns to know which type of values these two columns comprised of in the dataframe.
# Handling Missing values in Categorical datadf[['Cabin','Embarked']].head()
Number of null values both Cabin and Embarked column has in the titanic dataset.
# Number of Missing values in both the columnsdf[['Cabin', 'Embarked']].isnull().sum()
From the above exploration of Cabin and Embarked columns, Embarked column may contain more frequent values in it. Since the values are a type of class. So replacing those null values with most frequent value among them.
# Most frequent values in the Embarked column datadf['Embarked'].value_counts()
From the value_counts(), the most frequent value in the Embarked column is
S. So replacing all null values in the Embarked column with most frequent value
# Replacing the null values with the most frequent valuedf['Embarked'] = df['Embarked'].fillna(df['Embarked'].value_counts().index)
Let’s check the total number of null values in the Embarked column and also see the value_counts() to check whether they are replaced with
S or not.
# Checking for null values in the Embarked Columndf['Embarked'].isnull().sum()# Value counts for Embarked columndf['Embarked'].value_counts()
We can observe that the number of null values in Embarked column is as 0, and also value_counts() for
S had increased by 2. So null values in the Embarked column are replaced by most frequent value.
Another way of handling null values in the column which is of categorical type, is to add an Unknown class and replace those null values with that class.
# Replacing null values with Unknown Classdf['Cabin'] = df['Cabin'].fillna('Unknown')# Value counts for Cabin Columndf['Cabin'].value_counts()# Checking for null values in Cabin columndf['Cabin'].isnull().sum()
Null values in Cabin column are replaced by an Unknown class.
Now let’s see how to handle missing values in the columns which are of categorical type using SciKit Learn.
Replacing null values in Embarked column with most frequent value using SciKit Learn’s SimpleImputer class.
# Replacing null values in Embarked with most frequent valueimputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')new_df['Embarked'] = imputer.fit_transform(new_df[['Embarked']])# Value counts for Embarked columnnew_df['Embarked'].value_counts()
Null values in Embarked column are replaced by most frequent value using SimpleImputer.
Using SciKit Learn SimpleImputer class we can also replace a null value with a custom value. Replacing all null values in Cabin column with an Unknown class.
# Replacing null values in Cabin with Unknown classimputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='Unknown')new_df['Cabin'] = imputer.fit_transform(new_df[['Cabin']])# Checking for null values in the Cabin columnnew_df['Cabin'].isnull().sum()# Value counts for Cabin columnnew_df['Cabin'].value_counts().head(5)
# No of non - null values in the modified datasetdf.info()
The null values in the columns which are of Categorical type are identified and replaced.
Saving the modified dataset to an .csv file using pandas to_csv() method.
# Save to an csv filedf.to_csv('datasets/data_titanic.csv')
Predict the Missing Values
By using the columns or features that doesn’t have missing values, we can predict the null values in other columns using Machine Learning Algorithms like Classification algorithms, Decission Trees, Regression algorithms for both Numerical and Categorical type of data. This method may result in better accuracy.
Most of the dataset’s we will come across in real world contains null values. Those missing values are to be handled before training the model. In this story, we had gone through number of ways in which null values can be replaced.
Let me know if you have anything to ask. Do share the story if you liked it.