Different types of missing values & approaches to deal with them

Swarnima Pandey
Analytics Vidhya
Published in
3 min readJul 21, 2020
missing values

Before handling the missing values, we must know what all possible types of it exists in the data science world. Basically there are 3 types to be found everywhere on the web, but in some of the core research papers there is one more type of it. Let me introduce you with all of them very briefly-

  1. Structurally Missing Data- Let me tell you an example where we have the results of the students of a university of a particular semester and out of the entire data, some of the result values were missing. This may happen when either of the students have dropped out before exams or maybe were absent. So, this is a structurally missing value. In this case, the best possible solution is to deduce by inserting 0 at those missing places.
  2. MCAR (Missing Completely at Random)- When missing values are randomly distributed over entire dataset, MCAR occurs in instances where missing data is not related to the scores on the variables in the question and is not related to the scores on any other variables under analysis. For example, when data are missing for respondents for which their questionnaire was lost. Say you have complete data of 15 questions and incomplete data of 10. In this case, we compare these two datasets by some testing say t-test and if we don’t find any difference in means between the two samples of data, we can assume the data to be MCAR.
  3. MAR (Missing at Random)- Data is not missing randomly across entire dataset but is missing randomly only within sub samples of data. When the probability of missing data on a variable is related to some other measured variable in the model, but not to the value of the variable with missing value itself is MAR. For example, in an IQ dataset, only older people have missing value. Thus, the probability of missing data on IQ is related to age. Also, to assume this as MAR is difficult because there is no way of testing it.
  4. NMAR (Not Missing at Random)- When the missing data has no structure to it, we can’t treat it as missing at random. It may be the case where we can’t make conclusions to the missing value.

Some Common Approaches to deal with such type of missing data:

  1. Simple one: Drop the corresponding Column/ Row-
pd.Dataframe.isnull().dropna() 

If your data size is large and corresponding count of missing values in column/rows are comparatively quite low, then we use this approach.

2. Imputation- It fills the missing value with some number. The imputed value won’t be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column/row entirely. We can name some of the imputation techniques as below:

a) Mean/Median Imputation: As the name suggests, in this we replace missing values by mean or median of the total. We use this approach when the number of missing observations is low.

b) Multivariate Imputation by Chained Equations (MICE): It assumes that the missing data are Missing at Random (MAR). It imputes data on a variable-by-variable basis by specifying an imputation model per variable. It uses all the variables in the data for predictions.

3. Random Forest- Yes, it is also a non-parametric imputation method that works well with both data missing at random and not missing at random. It uses multiple decision trees to estimate missing values and outputs OOB (out of bag) imputation error estimates.

However, there are various other efficient methods to handle the missing values as per the given scenario and the type of data. I have discussed here the most common ones with you. Hope it was helpful, thanks for reading! Good luck!! Be safe!!

References:

--

--