Missing Value Imputation(Basics to Advance) Part 3

Banarajay
11 min readNov 11, 2022

--

Introduction:

Hello Folks in these article i continue the Missing value imputation remaining techniques.

Part 1 link:https://medium.com/@banarajay/missing-value-imputation-basics-to-advance-595c92e35e94

part 2 link:https://medium.com/@banarajay/missing-value-imputation-basics-to-advance-part-2-3eefededa19

please read above two article to get better understanding.

Imputation:

Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset.

Through we can reduce the disadvantages of the Deletion methods, that means

1.To avoid the reduction in the size of the dataset.

2.Also avoid the loss of information.

Two types of Imputation methods:

1.Single Imputation

2.multiple Imputation

1.Single Imputation:

In single imputation, single imputed value for each of the missing observation is generated.

Different types of single imputation based on the types of variable:

1.For imputing the Numerical variable:

Existing value based methods:

1.Minimum imputation

2.Maximum imputation

Statistical value based methods:

1.Mean imputation

2.Medium imputation

3.End of tail distribution

4.Interpolation

Model based methods:

1.Regression imputation

2.For imputing the Catagorical variable:

1.Use classification algorithm to impute the catagorical values.

3.For imputing both catagorie and numerical at the same time:

Existing value based methods:

1.previous imputation

2.Next imputation

3.Fixed value imputation or arbitary value imputation or constant imputation

Statistical value based methods:

1.Random sampling imputation

2.Most frequent value imputation

Model based methods:

1.Regression and Classification algorithms are used to impute

2.KNN

4.Other’s methods:

1.Adding “Missing” indicator

1.For imputing the Numerical variable:

1.Existing value based methods:

As the name suggest it uses the existing value in the variable to fill in the missing place.

1.Minimum imputation:

As the name suggest it uses the minimum value of the variable to do the imputation.

2.Maximum imputation:

As the name suggest it uses the maximum value of the variable to do the imputation.

Advantages for both maximum and minimum:

1.Easy to implementation.

2.Fast way of obtaining complete dataset.

Disadvantages for both maximum and minimum:

1.It distorts or change the original variable distribution and results in change in variance.

2.Due to change in variance, it distorts the covariance with remaining variables.

3.so Higher the percentage of missing values, higher the distortion.

4.It may be chance to form a outliers.

2.Statistical value based methods:

1.Mean imputation:

As the name suggest it uses the mean value of the variable to do the imputation.

Assumption:

1.Data is MAR(My intuition is MCAR and NMAR).

2.Data should follow the Normal distribution, because mean is valid for normal distribution, suppose if we have skewed distribution,
at the time mean is not valid.

Advantages:

1.Easy to implementation.

2.Fast way of obtaining complete dataset.

3.Preserve the loss of information.

Disadvantages:

1.It distorts or change the original variable distribution and results in change in variance.

2.Due to change in variance, it distorts the covariance with remaining variables.

3.so Higher the percentage of missing values, higher the distortion.

2.Medium imputation:

As the name suggest it uses the medium value of the variable to do the imputation.

Assumption:

1.Data is MAR(My intuition is MCAR and NMAR).

2.If Data follow the Skewed distribution at the time we can use these, because mean is valid for normal distribution, suppose if we have skewed distribution, at the time we can use medium.

Advantages:

1.Easy to implementation.

2.Fast way of obtaining complete dataset.

3.Preserve the loss of information.

Disadvantages:

1.It distorts or change the original variable distribution and results in change in variance.

2.Due to change in variance, it distorts the covariance with remaining variables.

3.so Higher the percentage of missing values, higher the distortion.

3.End of tail distribution:

As the name suggest, In end of tail distribution technique, missing values are replaced by the value that exist at the end of the distribution.

It is equivalent to the Arbitary value imputation, only difference is in arbitary value imputation we have to give value for imputation
but in End of tail distribution, automatically selects the arbitary value at the end of the variable distribution.

How to get tail value?

For getting tail value of the variable is two types:

1.If variable follows normal distribution, then uses the

tail value = upper Limit = mean of variable + 3(std of variable)

2.If variable follows skewed distribution, then uses the IQR proximity rule

tail value = upper Limit = 75th quantiles + IQR * 1.5

whereas,

IQR is the inter quantiles range(IQR) = 25th quantiles — 75th quantiles

so ,

For normal distribution,

Lower limit = mean of variable — 3(std of variable)

For skewed distribution.

Lower limit = 25th quantiles + IQR — 1.5

so these are called as lower tail value for respective distribution.

But in the tail of distribution we are interested in the higher tail values, so that only we are taking the upper limit values.

Assumption:

1.Data is NMAR(My intuition is MCAR and NMAR)..

Advantages:

1.Easy to implementation.

2.Fast way of obtaining complete dataset.

3.Preserve the loss of information.

Disadvantages:

1.It distorts or change the original variable distribution and results in change in variance.

2.Due to change in variance, it distorts the covariance with remaining variables.

3.so Higher the percentage of missing values, higher the distortion.

4.Interpolation:

They are many types of Interpolation, some of them are

1.linear interpolation.

2.polynomial interpolation.

Linear interpolation draw the line blw previous and next datapoint to find the missing value.

3.Model based methods:

1.Regression imputation:

As the name suggest it uses the Regression Model to impute the missing values.

Steps:

1.First we takes first missing variable as dependent variable.

2.Remaining variables as independent variables, suppose if there any missing value in the independent variable, we temporaly fill the
missing values in the independent variable by using mean, medium(any imputation method we can use, but it is temporary)etc….

3.Then Remove row/datapoints of the independent variable based on the missing value in the dependent variable and resultant dataframe without missing values row are considered as “train data”.

4.Removed rows/datapoints pf the independent variable based on the missing value in dependent variable are formed as “test data”.

5.Then we build the Regression Model(any regression model like LinearRegression, DecisiontreeRegressor etc..) by using the Train data.

6.Then by using the trained model to predict/impute the dependent variable/missing value of the test data by providing the test data to the trained model.

7.Finally reset temporary filled value to missing values in the independent variable.

8.Again take the next missing variable, do the above same process.

Assumption:

1.Data is MAR.

2.Then Assumption of the model you are using.

Advantages:

1.It distorts or change the original variable distribution less and results in less change in variance.

2.Due to less change in variance, it distorts the covariance with remaining variables in less also.

3.so Higher the percentage of missing values, lower the distortion.

Disadvantages:

  1. It takes more time.

2.For imputing the Catagorical variable:

1. Use classification algorithm to impute the catagorical values.

so we can also use any classification algorithm to impute the missing value in the catagorical variable.

The same procedure in the Regression imputation is also followed by the classification to do imputation.

3.For imputing both catagorie and numerical at the same time:

1.Existing value based methods:

1.previous value imputation:

As the name suggest missing value is imputed by using the previous value of the missing value in the variable.

It is mainly used for “Time series problem”.

2.Next value imputation:

As the name suggest missing value is imputed by using the next value of the missing value in the variable.

It is also mainly used for “Time series problem”.

3.Fixed value imputation or arbitary value imputation:

As the name suggest Arbitary value imputation consists of replacing missing values within the variable by using the given arbitary value.

Here arbitary value is any value, but not necessary within the values of the variable. But we can also choose the value within the values
of variable also.

Eg:

we can choose any value like -11,-99,999,0,56,678 or existing value in that variable.

Assumption:

Data is NMAR and MCAR.

Advantages:

1.easy to implement.

2.It is fast way to obtain complete dataset.

3.prevent data loss.

4.works well with a small dataset and is easy to implement.

Disadvantages:

1.Adding the new catagorie\value to the model,which results in poor performance.

2.It distorts or change the original variable distribution and results in change in variance.

3.Due to change in variance, it distorts the covariance with remaining variables.

4.so Higher the percentage of missing values, higher the distortion.

5.It may create a outlier.

2.Statistical value based methods:

1.Random sampling imputation:

As name suggest For each missing value in the variable, we are taking random observation from that variable and then by using that we are impute
the missing value.

Note here it do not use same value to impute the missing value in the variable, results in less distortion, so in each time it take random value in that particular variable to do imputation in that variable.

Assumption:

1.Data is MCAR.

2.Data is Normally Distributed(for avoiding to select the outliers).

Advantages:

1.easy to implement.

2.It is fast way to obtain complete dataset.

3.It distorts or change the original variable distribution less and results in less change in variance.

4.Due to less change in variance, it distorts the covariance with remaining variables in less also.

5.so Higher the percentage of missing values, lower the distortion.

Disadvantages:

1.But when we have high missing value at the time distortion is higher.

2.Due to randomly choosing the value, there may be a chance to randomly choose a outliers if data contains outliers, results in
poor performance.

2.Most frequent value imputation or Mode imputation:

As the name suggest, it consists of replacing all missing values within a variable by using the mode or most frequent value(catagorie/number)

Assumption:

1.Data is MCAR.

2.No more than 5% of missing value in the variable.

Advantages:

1.easy to implement.

2.It is fast way to obtain complete dataset.

Disadvantages:

  1. May lead to “Over-representation” of the most frequent label if there is a lots of missing observation.

Problem in Mode:

In most of the time mode is not present in the Numerical data, because the occurance of numerical value is not more than one.

But mode in the catagorical data is possible, so in most of the time Mode used for catagorical variables and rare in numerical variable.

3.Model Based Imputation:

1.KNN:

As the name suggest, we are using the nearest neighbor to fill the missing values in different way.

Algorithm:

1.Choose/select the any missing value and also note that row and variable of the selected missing value and call it as R(missing), Var(missing).

2.Then get the other values in the R(missing) of selected missing value call it as S(others values),that means suppose if i th cell is selected missing
value, then other cells in that selected missing value are considered as S.

3.And also choose the No of nearest neighbors(n).

4.And then calculate the distance blw S(others values) and all other rows by using the nan_euclidian distance matrics.

Why we cannot use euclidean_distance?

Problem in euclidean_distance:

When we have missing values, we cannot use euclidean distance to calculate the distance.

Euclidean distance does not support the missing values in the calculation.

Solution:

To rectify the above problem we are using “nan_euclidean distance matrics”

nan_euclidean distance:

nan_euclidean matric is used to calculate the distance blw the rows in the presence of missing values.

nan_euclidean matric = sqrt(weight * sum of squared differences by using non-nan present coordinates)

here,

weight = Total no of coordinates/No of present coordinates

What is the purpose of weight?

when we calculate the distance blw the S(other values) and other row,we only takes the non-nan coordinates.

so some distance calculation has all coordinates of the S(other values) and other row in the distance calculation and some does not have all coordinates, so results in biased in the distance calculation.

To rectify the bias problem in the distance calculation, we are using the weights.

5.Then choose the n datapoints that is closer to the missing value based on the distances calculated by using nan_euclidean matric.

6.And then get n selected datapoint value in the Var(missing)

7.Then impute the missing value by calculating the mean by using n selected value in the Var(missing).

Here they are two types of impute missing value by using mean:

1.uniform

2.distance

1.uniform:

It is like a normal mean calculation, that means simply take selected values in the var(missing) and calculate the mean value by using the selected values.

uniform_imputation = (x1 + x2 + ……+ xn)/n

2.distance:

As the name suggest we are taking the distance of the selected datapoint in the mean calculation.

distance_imputation = ((1/d1) x1 + (1/d2) x2 + ……+ (1/dn) xn)/n

so through we will give more importance or weight to fill the missing value by the nearest datapoint.

8.Repeate the above steps, untill all the missing values are imputed.

Assumption:

1.Data is MAR.

Example: Lets take an below dataset.

var(1) var(2) var(3) var(4)

1. nan nan 5 8

2. nan 3 nan 2

3. 5 2 7 nan

step 1:Select the missing value and also gather the rows and column info of the missing value.

In these case i will select the var(1) missing value and row of selected missing value is 2 and column is var(1).

step 2:Then get the other values of that missing values.

in these case other values of that missing value is

var(2) var(3) var(4)

3 nan 2

step 3:And also choose the No of nearest neighbors(n).

In these case i take n = 1.

step 4:And then calculate the distance blw other values and all other rows by using the nan_euclidian distance matrics.

var(2) var(3) var(4)

1. nan 5 8

2. 3 nan 2

3. 2 7 5

1.nan_euclidean_distance(2nd row,1st row):

var(2) var(3) var(4)

1. nan 5 8

2. 3 nan 2

Here,

total no of coordinates : 3 [(3,nan),(nan,5),(2,8)]

present coordinates(non-nan coordinates) : 1 [(2,8)]

weight = 3 / 1 = 3

nan_euclidean_distance(2nd row,1st row) = sqrt(3 * ( 2–8 )²)

= 10.39

2.nan_euclidean_distance(2nd row,3rd row):

var(2) var(3) var(4)

2. 3 nan 2

3. 2 7 5

Here,

total no of coordinates : 3 [(3,2),(nan,7),(2,5)]

present coordinates(non-nan coordinates) : 2 [(3,2),(2,5)]

weight = 3 / 2 = 1.5

nan_euclidean_distance(2nd row,3rd row) = (0.33 * ( 3–2 )² + ( 2–5 )²)

= 3.87

step 5:choose the n nearest neighbor.

In these case n = 1,so i choose the 1 nearest datapoint from the missing value based on the distance.

In these case 3rd row datapoint is nearest to missing data, because distance blw 2nd and 3rd is smaller compared to others.

step 6:Then get the values of selected datapoint in the var(missing):

In these case Var(missing) is Var(1),so from Var(1) i will select the value of selected nearest neighbor.

so the value of selected neighbor in the var(1) is 5.

step 7:By using the selected value, calculate the mean value and do imputation.

In these case i use “distance method”

But in these case, we have only one point so we directly by using that value to do imputation.

selected value = (1/3.87) * 5

= 19.35

After do imputation:

var(1) var(2) var(3) var(4)

1. nan nan 5 8

2. 19.35 3 nan 2

3. 5 2 7 nan

step 8:Then perform again the steps by selecting the other nan values.so do this process untill all the nan values are imputed.

NOTE:In Var(1), we take 2 missing value and for that only we find the imputed value and not for 1st missing value in the Var(1),we need to do imputation for 1st missing value seperately.

Advantages:

1.It can be more accurate than simple imputation in some times.

Disadvantages:

1.Computationally expansive, because KNN works by storing the whole training dataset in memory.

2.KNN is quite sensitive to outliers in the data(unlike svm).

4.Other’s methods:

1.Adding “Missing” indicator:

Missing indicator is the additional binary variable that indicates whether the data was missing or not.

Missing indicator is used together with another imputation.

so by using these, you can tell to the model which values are missing, because sometimes there may be a relationship blw the reason
for missing values and target variable(you are trying to predict).

Assumption:

1.Data is NMAR.

Advantages:

1.easy to implement.

2.It is fast way to obtain complete dataset.

Disadvantages:

1.It expands the feature space.

That’s all folks we will see multiple imputation in next article, if any wrong please correct me also give feedback Thank you, Have a nice day.

--

--