Missing Value Treatment

Ritesh Jain
Analytics Vidhya
Published in
7 min readJan 25, 2020

This article will talk about following aspects of Missing Values:

1) What are missing values

2) Understanding the importance of missing value treatment

3) Ways to treat missing values in Python

Data collection process is never full proof, especially when we are using physical forms. In the process at times we forget or intentionally don’t always collect few of the attributes/ fields. This is may be because of time constraint or we feel they are of not much importance. So the attributes that are not collected on regular basis tend to have missing values or unavailable data.

Example: Out of 100 people who have filled a survey only 75 have given their marital status. So marital status is not available for 25 people.

In order to analyse data either we skip records which have missing values or we impute them.

Missing value treatment is not only the most important step in a model building process, but also for any data analysis that is used for making decisions.

Missing values if not properly interpreted can lead to poor decisions which can lead to sever loss of business. For example if ‘Marital Status’ is very important for sending a personalized communication to the customers and let us assume for 20% of the records Marital Status is not available, either we don’t target these customers or we try to predict the value of this variable. If the prediction method is not good enough then we might end up sending incorrect communication to the customers. So its interpretation is really important if the decisions are to work as expected.

With the advent of Data Analytics many organizations have started paying attention to data availability and it has become a critical part of data governance system.

Organizations that have just started their operations have an edge over others as they are well prepared to take this challenge.

Data unavailability is not always a problem. There are fields in an application form or surveys that are not mandatory. Organizations feel the field or the attribute will not add much value and can be ignored. This hypothesis can easily be verified by checking the trend of this variable (with available data) with required outcome. If there is no trend then data unavailability for such variables is not really a problem. But if any trend is seen, then business should take a wise decision of making it as a mandatory field and start collecting it.

Sometimes an extra field can help in developing personalized campaigns for the customer thereby improving the intensity of engagement or it can lead to better conversion rates.

There are a lot methods for treating missing values. It all depends on how much time you want to spend on it. Obviously, complex methods do require time when you do it for the first time, but also when you do it on regular basis depending on the frequency of analysis or model execution. So one should take this in to consideration while finalizing the methodology. Here are various techniques that can be explored. If I have missed out on anything, do write about it in the comments section.

  1. Dropping records with at least one missing value
  2. Drop columns that are least significant and has majority of missing values
  3. Replace missing values with mean, median and mode OR consider missing values as a different category
  4. Correlation — Depending on the strength of correlation missing values can be imputed
  5. Predicting missing values using Regression technique

Before we get started with above techniques, let me brief you about the dataset that has been considered. It is a housing loan dataset with all customer details that a company takes in order to decide whether the loan should be given or not. It has been taken from practice exercise of Analytics Vidhya and can be found here: https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/

# Importing another dataset
loan = pd.read_csv(‘D:\\Hackathon\\Loan Prediction\\train.csv’)
loan.head(6)

Let us get started with above techniques in detail one by one:

  1. Removing records with at least one missing value: It simply means removing all the records that have at least one missing value.
#Drop rows even with one missing value
loan_drop = loan.dropna()
loan_drop.shape
Output - (480, 13)

When we drop records with even one missing value, number of records in the dataset is reduced by 21.7% (from 614 to 480). This might lead to significant Information loss of 21.7%.

This is the easiest way to get rid off missing values but at the same time there is a high information loss. So confidence on decisions taken post removing missing values will not be high.

2. Drop columns that are least significant and have majority of missing values: Here with some prior business/ domain knowledge you can easily decide columns that are least significant and drop them if they have high percentage of missing values.

# Getting No. of Missing values and % of Missing values in each column - 
loan_missing_values = pd.DataFrame(loan.isnull().sum())
loan_missing_values[‘var’] = loan_missing_values.index
loan_missing_values = loan_missing_values.rename(columns={0:’missing_nos’})loan_missing_values[‘Percentage_Missing’] = loan_missing_values[‘missing_nos’] *100/ len(loan)loan_missing_values
Output -
## Assuming Self_Employed being least Significant or insignificant and deleting the column
loan_drop_Self_Employed = loan.drop(“Self_Employed”, axis=1)
# Checking no of records removed from the dataset
loan_drop_Self_Employed.dropna().shape
Output - (505, 12)

By removing columns which are of least significance, number of records is reduced by 17% (from 614 to 505). So Information loss is 17%. Although not the best way to treat missing values, but better than the first one.

3. Replace missing values with mean, median and mode OR consider missing values as a different category: Here the first step is to calculate mean, median or mode of the values of a particular variable that are available (non missing values). Second and final step is to replace missing values by so calculated mean/ median or mode. One more technique is to consider missing value as a different category all together. For example Marital status variables has 10% missing values. For these 10% of the records we can hard code it as ‘Others’.

## Considering loan amount as the variable
## First checking the mean of LoanAmount variable
loan[“LoanAmount”].mean()
Output - 146.41216216216216
#Replace missing value with Mean of the column
loan_amount_mean = loan.fillna(loan["LoanAmount"].mean())
#checking first 6 rows of the table
loan_amount_mean[["Loan_ID", "LoanAmount"]].head()
Output -

Similarly we can replace missing values by Median and Mode.

#Replace missing value with Median of the column
loan_amount_median = loan.fillna(loan[“LoanAmount”].median())
#Replace missing value with Modeof the column
loan_amount_mode = loan.fillna(loan["LoanAmount"].mode()[0])

This best of the three techniques that we have discussed. There is no information loss as we are considering all the records in data analysis or model building exercise. The only drawback here is we are treating all the missing values by same value. Example — Loan Amount requirement of a customer who is 45 years of age will be different from someone who is 25 years of age. 45 year old customer may ask for a lesser loan amount compared to 25 year old.

4. Correlation: Depending on the strength of correlation missing values can be imputed. It ranges from -1 to 1 depending on the type of correlation. -1 and 1 indicate perfect correlation; one variable can be replaced by the other as they are explaining each other completely. Here is the definition for various values of correlation:

1–0.8 → Very strong
0.799–0.6 → Strong
0.599–0.4 → Moderate
0.399–0.2 → Weak
0.199–0 → Very Weak

## Importing required libraries
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm
## Correlation Matrix
corrmat = loan.corr()
f, ax = plt.subplots(figsize =(4, 3))
sns.heatmap(corrmat, ax = ax, cmap ="YlGnBu", linewidths = 0.01)

From the above plot it is very clear that Applicant Income and Loan Amount are correlated with each other. Let us plot a correlation table to understand the degree of correlation:

loan.corr(method =’pearson’)

The correlation is 0.57 though moderate but on higher side (of moderate). We can also have a scatter plot which is visually appealing and better way to represent correlation between two continuous variables.

# Scatter Plot between Loan Amount and Applicant Income
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)sns.regplot(x="LoanAmount", y="ApplicantIncome", data=loan, color = 'brown');

This is by far the best of all the techniques that we have discussed. Though this technique also has drawbacks. You will be in a tricky situation if the dataset does not has highly correlated variables. Also there would be instances where values for both the correlated variables are not present.

But, this can be overcome by running a regression model.

5. Predicting missing values using Regression technique: This technique is the best way to impute missing values but at the same time the most complex and time consuming as well. This method is to be chosen depending on the time and problem in hand. Here you can statistically predict missing values and which can be in turn used in the predicting other variables. I wont be discussing python codes here as this needs a fresh article, which should come by the end of next Feb’2020 or Mar’ 2020.

Recommendation:

Depending on the time in hand and complexity that you can stick to, you can evaluate these techniques and pick the one that you are comfortable with. One important aspect that one should always consider is applying business/ domain knowledge along with these techniques. Sometimes you might feel Mean is a better than Median, but once you apply domain knowledge you will be in a better position to decide which one to consider.

--

--

Ritesh Jain
Analytics Vidhya

AI Enthusiast | Python | Machine Learning | Data Scientist | Predictive Analytics