Missing Value Imputation(Basics to Advance)

4 min readNov 8, 2022

Introduction:

Hello Folks in this article i explain about Missing Value Imputation from basic concept to advance concept.

Topics to be covered:

Why is it important to handle missing data?
Different ways of Missing value represented in the dataset
Possible ways of missing value are generated
Types of missing values:

Why is it important to handle missing data?

The data in the real world has many missing data in most cases. There might be different reasons why each value is missing like human error during collecting the data or corruption of data or there might be specific reasons also.

So results in

1.Decrease the predictive power of your model.

2.Incompatible with most of the Python libraries used in Machine Learning, While using the some algorithms like Linear Regression etc.. in sklearn, they don’t have a skill to automatically handle these missing data and can lead to errors.

3.Distortion in Dataset

Different ways of Missing value represented in the dataset:

They are many ways of missing values in the dataset are represented, some of them are,

1. Nan

2. ?

3. -999 (any no of 9’s combination)

4. n/a

5. NA

6. —

etc..

Possible ways of missing value generated:

1.People do not give information regarding certain questions in a data collection survey.

For example, some may not be comfortable sharing information about their salary, drinking, and smoking habits.so results in the missing value.

2.In some cases, data is getting from various past records available and not from directly. In this case, data corruption is a major issue, due to low maintenance, some parts of data are corrupted giving rise to missing data.

3.Inaccuracies during the data collection process also contribute to missing data.

For example, in manual data entry, it is difficult to completely avoid human errors.

4.Equipment inconsistencies leading to miss the measurements, results in the Missing value.

Types of missing values:

There are two types of missing value.

1.Unit Non-response

2.Item Non-Response

1.Unit Non-response:

It refers to entire row of missing data.

For example, people who choose not to fill out the census.

But the accurance of these type is very rare.

Imputation methods:

weighted class-adjustment(we will see in later)

2.Item Non-Response:

It refers to some of the cells of the column are missing.

It is accure mostly in the real world.

And also it is further divided into three types,

1.Missing Completely at Random.

2.Missing At Random.

3.Not Missing at Random.

Missing Completely at Random:
It refers to missing data do not follow any particular pattern, they are simply random, that means missing value is generate in the one variable is not due to the other variables [OR] The missing data is unrelated or independent of the remaining variables.

For example, During the collection of data, a particular sample gets lost due to carelessness or people not willing to tell and but not due to other variable or question during survey.
In other words when we ask random questions(no correlation among these random questions) from pre-defined list of questions to the people, some of them do not tell the answer, may leads to the MCAR.
But Occurance of these MCAR is very rare, so statistically the analysis will not be biased, because the error is accure in small number so it doesnot gives much effect in the statistical analysis.

Possible methods are used for MCAR:

1.Due to Occurance of MCAR is very rare, so we can use Deleting methods.

2.And also it does not depends on the other variables, so we can use simple imputation like mean, medium etc….

why we use mean, medium?

Because mean and medium methods does not use other variables to calculate the mean and medium value, it only use that particular variable values to calculate.so we are using the these methods.

2. Missing At Random (MAR):

Here the missing data of the independent variable is related or dependent of the remaining variables.so it not a random.

For example, let us consider a survey about time spent on the internet, which has a section about time spent on platforms like Netflix, amazon prime. It is observed that older people(above 45 years) are less likely to fill it than younger people. This is an example of MAR. Here, the ‘Age’ parameter decides if the data will be missing or not.

In real world, MAR occurs very commonly than MCAR.

Possible methods are used for MAR:

Here we can use Model based missing value imputation like Linear Regression, Decision tree etc….because by using the algorithm, we can use other variables to predict the missing values.

3. Not Missing at Random (NMAR):

This is a serious and tricky situation.

For Example: Let’s say the purpose of the survey is to measure the overuse or addiction to social media. If people who excessively use social media, do not fill the survey intentionally, then we have a case of NMAR.

so this will most probably lead to a bias in results.

Possible methods are used for NMAR:

1.The usual methods like dropping rows/columns,imputation will not work. To solve this, in-depth knowledge of the domain would be necessary to handle these.

Ok folks i end this article by giving introdction in next article i will explain techniques to deal with missing values. please provide feedback and also if any wrong please correct me. Thank you.

Part 2 link:https://medium.com/@banarajay/missing-value-imputation-basics-to-advance-part-2-3eefededa19