Missing Value Imputation(Basics to Advance) Part 2

Banarajay
5 min readNov 9, 2022

--

Introduction:

Hello Folks in previous article we saw introduction to basic concepts of Missing values. In these article we will see the Missing Value Imputation techniques.

Part 1 link:https://medium.com/@banarajay/missing-value-imputation-basics-to-advance-595c92e35e94

Topics to be covered:

  1. Check for missing values
  2. Visuvalization techniques for Missing values
  3. Different methods/techniques to handle the missing values

Check for missing values:

  1. The isnull() function is used to check whether dataset contains missing values. When you call the sum function along with isnull, the total sum of missing data in each column is displayed.
  2. we can also use isna() function.

Visuvalization techniques for Missing values:

For visuvalization for Missing value, we are using the “Missingno” python package.

By using the Missingno python package, we can see the missing values in the form of

1.heat maps

2.bar charts and

3.matrices.

so by using the above plots, we can also able to conclude that what category missing value fall into MCAR, MAR, or NMAR.

For Example: Lets take a Titanic dataset and by using the Missingno package bar ,heat and matrices to visuvalize the missing data in titanic dataset.

Matrices Plot and Heatmaps:

It creates the Matrices,in that you will see blank lines for each missing data.

Matice plot

If you see the result of matrics, it helps us to know how the missing data is distributed through the data, that is if they are localized or evenly
spread, or is there any pattern occured.

ok lets come to titanic dataset in that “Embarked” column has just two missing data which is represented by two blank lines in the metrics and results in which does not follow any pattern. They were probably lost
due to human error(because accurance of these error is small in number).So, this can be classified as Missing completely at Random.

And age and Cabin columns could possibly be MAR, because missing value for these two variable is quiet high.so these were probably due to other variable.

But the Question rise is “How can we confirm that Age and Cabin as MAR type and Embedded as MCAR type“?

so for confirmation we are using the “Heat map”.

The values in the heatmap are correlation values of the variables.so we can make the confirmation by seeing the correlation values. Here Correlation value indicates how missing value in the one variable is caused by other variable. Suppose higher the correlation indicates missing value in variable is more caused by other variable and lower the correlation is vice versa.

Heatmap Plot

Bar plot:

Bar plot wil shows how much non-missing values in each column.

Different methods to handle the missing values:

1.Deletion methods

2.imputation methods

1.Deletion methods:

As the name suggest,it delete the row or column of missing value.

There are three approaches of Deletion methods:

1.Listwise deletion or Complete Case Analysis

2.pairwise deletion or Avaliable Case Analysis

3.Dropping Features

1.Listwise deletion or Complete Case Analysis:

Simply Delete the rows where one or more values are missing.


2.Pairwise deletion or Avaliable Case Analysis:

Delete only the rows that have missing values in the columns used for the “analysis”, that means suppose if we three variable, but we are using only first two variables in the analysis, so we are dropping the rows if any missing values in the first two features and we do not drop the rows because we do not considered or taken the last feature for analysis.

Suppose if i use all the three features, then it is similar to the Complete case analysis method.

It is only recommended to use this method if the missing data are MCAR.

3.Dropping Features:

Drop entire columns, if column contains percentages of missing value greater than threshold(user given value, ranges from 0 to 1).

When to use these deletion methods:

1.Large dataset:

When we have large dataset, at the time we can use these method to delete the missing rows, but here also some information is loss due to deleted row, but when compared to loss of information in small dataset, here loss is small, but in small dataset loss is higher if we use these method.

2.When Data is MAR(Missing At Random), because when have MAR type we know if there any missing value in the row definitely due to the other variables, so at the time if we delete the entire row, it does not give much loss of information when compared to deletion method applied in MCAR type(higher loss of information in MCAR).

3.It is used for Mixed ,Numerical and Categorical data.

4.It is used when Missing data percentage is blw 5% — 6% of the dataset, because suppose if we have more missing percentage(greater than 6%),then if we apply deletion method definitely we are delete or remove more datapoints, results in the loss of information and reduced dataset.

Which types we can use these?

we can use these only for MAR type.

Advantages:

1.simplicity

2.fast

Disadvantages:

1.It would not works for the small dataset, because if we have small dataset with missing value, then if you apply delete methods, defintely we are suffered from underfitting problem, because after apply deletion method, we have only smaller no of datapoints in the small dataset.

Then if you try to train any model by using these smaller no of datapoints of the small dataset, then our model cannot able to learn the pattern from these smaller no of datapoints, results in underfitting.

2.Due to apply these method, it losses the some of the information,that means row is very important for analysis, but due to these method we are delete these row eads to the loss of information.

3.It can also create a bias in the dataset, suppose in the case of classification, if large amount of a particular type of class is deleted from it, then we are suffered from bias in the result.

Ok folks in next article we will see the other imputation methods in detail manner, please provide feedback and also if any wrong please correct me. Thank you.

--

--