# A Comprehensive guide on handling Missing Values

## Most of the real world data contains missing values. They occur due to many reasons like some observations were not recorded and corruption of data.

Handling missing values is important because most of the machine learning algorithms don’t support data with missing values. These missing values in the data are to be handled properly. If not, it leads to drawing inaccurate inference about the data.

Lets us look at different ways of handling missing values using both **Pandas** and **SciKit-Learn**.

We will be working on Titanic dataset, to explore various methods of handling missing values. Let us get started.

*Download titanic dataset by clicking here.*

First we will import all necessary libraries. Where Numpy package is for performing Mathematical operations and Pandas package is for manipulating dataset.

# Importing all necessary librariesimport numpy as np

import pandas as pd

Now we will be loading our downloaded titanic dataset using pandas *read_csv()*** **into a dataframe. And also we will be creating a copy of that dataframe, so that we will handle the Missing Values using Pandas in one dataframe and using SciKit-Learn using another.

# Loading the datasetdf = pd.read_csv('datasets/titanic.csv')# Copying the dataframenew_df = df.copy()

Let us view first few rows of data to get insights about the data using pandas ** head()**.

# First few rows of datadf.head(10)

From the first few rows, we noticed that the dataset comprises of columns like PassengerId, Survived, Pclass, Cabin ..etc, total 12 columns. Missing values (NaN ) are visible in Age and Cabin columns, by viewing first 10 rows of data. We will look into more deep about columns having missing values and no of missing values each column has later. Lets do some Data Exploration to get some insights from it.

Let’s see the dimensions of data. Dimensions mean the number of columns and the number of observations of data, the dataset comprised of. To get the dimesions of data we will be using pandas *shape()**.*

# Get the dimensions of the datadf.shape

By executing the pandas.shape, it returned (891, 12). It indicates that the whole dataset comprised of 891 observations and 12 columns in it.

Now let’s check the summary statistics of data using pandas* **describe()*** **method

*.*# Summary Statistics of the datadf.describe()

It returned descriptive statistics of all Numerical type column’s. We can view count, mean, median, max ..etc, of each numerical data type column in the dataset. By this summary statistics, we can conclude whether there are outliers present or not which effects mean. For titanic dataset there are no outliers present.

Now check the datatype of each column in the dataset using pandas ** dtypes**.

# View the datatype of each column in the datasetdf.dtypes

A basic exploration of data is done. Now quickly jump into the various techniques to handle the missing values.

## Number of Missing Values in each column

To get the number of missing values in each column, we can use pandas ** isnull()** or

**methods.**

*notnull()***isnull()**method returns an boolean value

**TRUE**when null value is present whereas

**notnull()**method returns an boolean value

**TRUE**, if there is no null value for every observation in the data. We will add them so that it gives the no of null or non-null values in both cases.

# Number of missing values in the datadf.isnull().sum()

# Number of non - null values in the datadf.notnull().sum()

So from the above observations, null-values are present in Age, Cabin and Embarked columns. In Age, there are about 177 null values, 687 in Cabin and 2 in Embarked. Even you get the same when we subract number of non-null values from total number of observations.

We can also get the number of non-null values of each column with it’s datatype using pandas ** info()** method.

# Number of non - null values in the datadf.info()

*Numerical Data*

*Numerical Data*

Missing values are present in ** Age**,

**and**

*Cabin***columns.**

*Embarked***Age**is of

**Numerical**Datatype and

**,**

*Cabin***are of**

*Embarked***Object**Datatype which may be string or a character. Now we are going to look how to handle Missing Values in colums which are of Numerical Datatype.

Either a column is Numerical or Categorical, we can delete the observations having null values in the dataset or the column that is having more number of null values.

Observations that are having null values can be deleted using pandas ** dropna() **method.

# Deleting rowsmod_df = df.dropna()# No of non - null values in the modified datamod_df.info()

So there are only 183 observations in the modified dataset out of 891 observation after dropping rows having null values.

Now let’s see how to drop columns that are having maximum number of null values in them.

# Set the limit

# Drop columns using that limitlimit = len(df_n) * 0.7

new=df.dropna(axis=1, thresh=limit)# View columns in the datasetnew.columns

We have set a limit to 70 percent, so that the columns having null values more than or equal to 70 percent are dropped from the dataframe. Viewing columns to see which column in the dataset has been dropped.

So from the above, we noticed that the ** Cabin** column has been dropped. Since

**column consists of nearly 687 null values out of 891, which is 77.12 percent which meets our limit criteria. So**

*Cabin***columns has been dropped.**

*Cabin*This method of deleting rows/columns is advised only when there are enough samples in the dataset. Dropping of column having more number of null values is to be done only when column deleted doesn’t effect our analysis or that column is out of scope for our model. Removing data leads to loss of information which inturn we will not the desired output while predicting.

Imputing with Mean/Median/Mode/Backward fill/ Forward fill

Null values are replaced with mean/median.mode in this method. This is the statistical method of handling Null values. This method yields good results when compared to removing null values.

The mean of the numerical column data is used to replace null values when the data is normally distributed. Median is used if the data comprised of outliers. Mode is used when the data having more occurences of a particular value or more frequent value.

Let’s replace null values in the **Age** column by ** pandas**.

# Mean of Age columndf['Age'].mean()

df['Age'].median()

df['Age'].mode()

We have got ** mean** as

**29.69,**as

*median***28.0**and

**as**

*mode***24.**0 for

**Age**column. Since Age column has no outliers in it, we are replacing null values with mean using pandas

**or**

*replace()***methods.**

*fillna()*# Replace Null Values (np.nan) with meandf['Age'] = df['Age'].replace(np.nan, df['Age'].mean())# Alternate way to fill null values with meandf['Age'] = df['Age'].fillna(df['Age'].mean())

Let’s check the total number of null values in the **Age** column

# Checking for null values in Age columndf['Age'].isnull().sum()

We’ve got number of null values in the **Age** column as 0. Hence null values in Age are replaced by its mean. Wanna have a glimpse, whether they are replaced with mean or any other value start comparing any observation having null value before and after this step. We will notice that mean value is replaced in its position. Similar to that, we can also replace null values with median and mode.

# In the same way we can impute using median and modedf['Age'] = df['Age'].replace(np.nan, df['Age'].median())df['Age'] = df['Age'].replace(np.nan, df['Age'].mode())# Alternate ways to impute null values with median and modedf['Age'] = df['Age'].fillna(df['Age'].median())df['Age'] = df['Age'].fillna(df['Age'].mode())

In addition to this statistical method of filling null values, we can also fill null values by it’s previous value in the column which is called **Backward fill** or next occurring value in the column which is called **Forward fill**.

# Backward fill or Forward fill can be used to impute the previous or next values# Backward filldf['Age'] = df['Age'].fillna(method='bfill')# Forward filldf['Age'] = df['Age'].fillna(method='ffill')

Now let’s replace null values in **Age** column by **SciKit Learn****.**

Replacing null values with mean using **S****ciKit Learn’s** *S*** impleImputer** class. We can change strategy parameter as median when we want replace those null values with median.

# Replacing the null values in the Age column with Meanfrom sklearn.impute import SimpleImputerimputer = SimpleImputer(missing_values=np.nan, strategy='mean')# Fit and transform to the parametersnew_df['Age'] = imputer.fit_transform(new_df[['Age']])# Checking for any null valuesnew_df['Age'].isnull().sum()

Hence, the null values are replaced with mean and no null values are present in the **Age** column. We can also replace them with median as follows

# Alternatively, we can replace null values with median, most frequent value and also with an constant# Replace with Medianimputer = SimpleImputer(missing_values=np.nan, strategy='median')new_df['Age'] = imputer.fit_transform(new_df[['Age']])

Hence the null values in the columns which are of Numerical type are identified and handled. Now lets see different methods of handling categorical data.

**Categorical Data**

**Cabin** and **Embarked **columns are of ** Categorical** datatype. Now let’s see how to handle null values which are of Categorical type. View first few rows of these two columns to know which type of values these two columns comprised of in the dataframe.

# Handling Missing values in Categorical datadf[['Cabin','Embarked']].head()

Number of null values both ** Cabin** and

**column has in the titanic dataset.**

*Embarked*# Number of Missing values in both the columnsdf[['Cabin', 'Embarked']].isnull().sum()

From the above exploration of ** Cabin** and

**columns, Embarked column may contain more frequent values in it. Since the values are a type of class. So replacing those null values with most frequent value among them.**

*Embarked*# Most frequent values in the Embarked column datadf['Embarked'].value_counts()

From the ** value_counts()**, the most frequent value in the

**Embarked**column is

`S.`

So replacing all null values in the **Embarked**column with most frequent value

`S.`

# Replacing the null values with the most frequent valuedf['Embarked'] = df['Embarked'].fillna(df['Embarked'].value_counts().index[0])

Let’s check the total number of null values in the **Embarked** column and also see the value_counts() to check whether they are replaced with `S`

or not.

# Checking for null values in the Embarked Columndf['Embarked'].isnull().sum()# Value counts for Embarked columndf['Embarked'].value_counts()

We can observe that the number of null values in **Embarked** column is as 0, and also value_counts() for `S`

had increased by 2. So null values in the **Embarked** column are replaced by most frequent value.

Another way of handling null values in the column which is of categorical type, is to add an ** Unknown** class and replace those null values with that class.

# Replacing null values with Unknown Classdf['Cabin'] = df['Cabin'].fillna('Unknown')# Value counts for Cabin Columndf['Cabin'].value_counts()# Checking for null values in Cabin columndf['Cabin'].isnull().sum()

Null values in **Cabin **column are replaced by an **Unknown** class.

Now let’s see how to handle missing values in the columns which are of categorical type using **SciKit Learn.**

Replacing null values in **Embarked** column with most frequent value using SciKit Learn’s ** SimpleImputer** class.

# Replacing null values in Embarked with most frequent valueimputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')new_df['Embarked'] = imputer.fit_transform(new_df[['Embarked']])# Value counts for Embarked columnnew_df['Embarked'].value_counts()

Null values in **Embarked** column are replaced by most frequent value using ** SimpleImputer**.

Using SciKit Learn ** SimpleImputer** class we can also replace a null value with a custom value. Replacing all null values in

**Cabin**column with an

**class.**

*Unknown*# Replacing null values in Cabin with Unknown classimputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='Unknown')new_df['Cabin'] = imputer.fit_transform(new_df[['Cabin']])# Checking for null values in the Cabin columnnew_df['Cabin'].isnull().sum()# Value counts for Cabin columnnew_df['Cabin'].value_counts().head(5)

# No of non - null values in the modified datasetdf.info()

The null values in the columns which are of Categorical type are identified and replaced.

Saving the modified dataset to an *.csv* file using pandas ** to_csv()** method.

# Save to an csv filedf.to_csv('datasets/data_titanic.csv')

new_df.to_csv('datasets/data_titanic-new.csv')

## Predict the Missing Values

By using the columns or features that doesn’t have missing values, we can predict the null values in other columns using Machine Learning Algorithms like Classification algorithms, Decission Trees, Regression algorithms for both Numerical and Categorical type of data. This method may result in better accuracy.

Most of the dataset’s we will come across in real world contains null values. Those missing values are to be handled before training the model. In this story, we had gone through number of ways in which null values can be replaced.

GitHub Link- https://github.com/itzzmeakhi/Medium/tree/master/Handling-MissingValues

Let me know if you have anything to ask. Do share the story if you liked it.