Ways to impute missing values in the data.

Abhigyan
Analytics Vidhya
Published in
4 min readMay 10, 2020

Missing data presents a problem in many fields,including data science and machine learning. The data can be missing throughout the dataset at random places or in a specific column, in recurring patterns, or in large sections(more than 50% of the column).

Missing data occur when no value is stored for the variable in the column (or observation). Missing data are a common problem and can have a significant effect on the conclusions that can be drawn from the data.

How does the missing data affect?

  1. Incomplete datasets can lead to misleading conclusions.
  2. The absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false.
  3. Bias is caused in the estimation of parameters due to missing values.
  4. Importance of the samples are reduced.

Missing data are of different types,check out this link if you want to know about them,otherwise feel free to skip onto ways to impute them.

What is data imputation?

Imputation is the process of replacing the missing data with approximate values. Instead of deleting any columns or rows that has any missing value, this approach preserves all cases by replacing the missing data with the value estimated by other available information.

Different types of missing data requires to be handled differently,as shown in the pic below.

Now,the fourth category of missing data that is structured missing data can not be treated as they are not meant to contain any information,so simply impute them with 0 value if numeric or some different category(like not specified or so)if object.

Note:Imputation of missing data is done under the assumption that the data is Missing at Random(MAR).

Easy Ways to impute missing data!

1.Mean/Median Imputation:- In a mean or median substitution, the mean or a median value of a variable is used in place of the missing data value for that same variable.

Pros :

  • These imputation is simplest to understand and apply.
  • These might be a rational approach, in case that the univariate average of your variables is the only metric your are interested in.
  • Easy and fast.

Cons :

  • Mean substitution leads to bias in multivariate variables such as correlation or regression coefficients.
  • Not very accurate.
  • Doesn’t account for the uncertainty in the imputations.

Note:- Mean and Median imputation works only with numerical data,trying mean or Median imputation with categorical variable makes no sense.

However, with missing values that are not strictly random, especially in the presence of a great difference in the range of number of missing values for the different variables, the mean and median substitution method may lead to inconsistent bias.

2.Mode substitution:- In mode substitution,the highest occuring value for categorical value is used in place of the missing data value of the same variable.

Pros :

  • Can be used on categorical features.

Cons :

  • Bias is introduced in the data.
  • It also never factors the correlations between features.

Other Imputation Methods:

  1. Regression Imputation:- In regression imputation,prediction is made with the existing variable, and then the predicted value is imputed in the missing value place. This approach has a number of advantages, because the imputation retains a great deal of data and avoids significant alteration of the standard deviation or the shape of the distribution. However, as in a mean substitution, while a regression imputation substitutes a value that is predicted from other variables, no peculiar information is added, while the sample size has been increased and the standard error is reduced.
  2. Maximum Likelihood:- In statistics the maximum likelihood estimator(aka MLE) is any statistical estimator for a distribution of interest which has the property that it maximizes the Similarity function of that data.
  3. Stochastic Regression Imputation:-It is quite similar to regression imputation which tries to predict the missing values by regressing it from other related variables in the same dataset plus some random residual value.
  4. Hot-Deck Imputation:-Works by randomly choosing the missing value from a set of related and similar variables.
  5. Cold-Deck Imputation:-A systematically chosen value from an individual who has similar values on other variables.This is similar to Hot Deck in most ways, but removes the random variation.

There are some set rules to decide which strategy to use for particular types of missing values, but the best way is to experiment and check which model works best for your dataset.

Like my article? Do give me a clap and share it,as that will boost my confidence.Also,I post new articles every sunday so stay connected for future articles of the basics of data science and machine learning series.

Also,Do connect with me on linkedIn.

Photo by Alex on Unsplash

--

--