## Data Science Secrets

# Guide to Handling Missing Values in Data Science

Types of Imputation and When to Use Them

Missing values are the Achilles’s heel for a data scientist. If not handled properly, the entire analysis will be futile and provide misleading results which could potentially harm the business stakeholders.

## Types of Missing Data:

D.B Rubin (1976) classified missing data problems into three categories. In his theory every data point has some likelihood of being missing. The process that governs these probabilities is called the ‘missing data mechanism’ or ‘response mechanism’. The model for the process is called the ‘missing data model’ or ‘response model’.

Rubin’s distinction sets the conditions under which a missing data handling method can provide valid statistical inferences.

**Missing Completely at Random (MCAR)**

If the probability of being missing is the same for all cases, then the data are said to be missing completely at random (MCAR). This effectively implies that causes of the missing data are unrelated to the data. It is safe to ignore many of the complexities that arise because of the missing data, apart from the obvious loss of information. Most simple fixes only work under the restrictive and often unrealistic MCAR assumption.

**Example:** Estimate the gross anual income of a household within a certain population, which you obtain via questionnaires. In the case of MCAR, the missingness is completely random, as if some questionnaires were lost by mistake.

**Missing at Random (MAR)**

If the probability of being missing is the same only within groups defined by the *observed* data, then the data are missing at random (MAR). It is more general and more realistic than MCAR. Modern missing data methods generally start from the MAR assumption.

**Example:** Suppose some household income information is missing.In the case of MAR, the missingness is random within subgroups of other observed variables. For instance, suppose you also collected data on the profession of each subject in the questionare and deduce that managers, VIPs etc are more likely not the share their income, then, within subgroups of the profession, missingness is random.

**Not Missing at Random (NMAR)**

If neither MCAR nor MAR holds, then we speak of missing not at random (MNAR). In the literature one can also find the term NMAR (not missing at random) for the same concept. MNAR means that the probability of being missing varies for reasons that are unknown to us. MNAR includes the possibility that the scale produces more missing values for the heavier objects (as above), a situation that might be difficult to recognize and handle. An example of MNAR in public opinion research occurs if those with weaker opinions respond less often. MNAR is the most complex case. Strategies to handle MNAR are to find more data about the causes for the missingness, or to perform what-if analyses to see how sensitive the results are under various scenarios.

**Example:** In the case of MNAR when the reason for missingness depends on the missing values itself. For instance, suppose people don’t want to share their income as it is less and they are ashamed of it.

## Ways to Handle Missing Values

When it comes to handling missing values, you can take the easy way or you can take the professional way.

**The Easy Way:**

**Ignore tuples with missing values:**This approach is suitable only when the dataset is quite large and multiple values are missing within a tuple.

Is an option only if the tuples containing missing values are about 2% or less. Works with MCAR.

**Drop missing values:**Only ideal if you can afford to loose a bit of data.

Is an option only if the number of missing values is 2% of the whole dataset or less.

Do not use this as your first approach.

**Leave it the the algorithm:**Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (ie. XGBoost). Some others have the option to just ignore them (ie. LightGBM —*use_missing=false*). However, other algorithms throw an error about the missing values (ie. Scikit learn — LinearRegression).

Is an option only if the missing values are about 5% or less. Works with MCAR.

## The Professional Way:

The drawback of dropping missing values is that you loose the entire row just for the a few missing values. That is a lot of valuable data. So instead of dropping the missing values, or even ignoring them in the case of tuples, try filling in the missing values with a well calulated estimet. Or as the professionals call it, impute the missing values.

## Types of Imputation

The built-in dataset MTcars is used to demonstrate each method.

## Easy Imputations

## Mean/Median Imputation a.k.a Constant Values Imputation

Calculate the mean of the observed values for the variable for all individuals which are non-missing. It has the advantage of keeping the same mean and the same sample sizes.

**Advantages:**

- Quick and easy
- Ideal for small numerical datasets

**Disadvantages:**

- Doesn’t factor the correlations between features. It only works on the column level.
- Will give poor results on encoded categorical features (do NOT use it on categorical features).
- Not very accurate.
- Doesn’t account for the uncertainty in the imputations.

**Most Frequent (Values) Imputation**

**Most Frequent **is** **another statistical strategy to impute missing values and works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.

**Advantages:**

- Works well with categorical features.

**Disadvantages:**

- It also doesn’t factor the correlations between features.
- It can introduce bias in the data.

## Zeros Imputation

It replaces the missing values with either zero or any constant value you specify.

Perfect for when the null value does not add value to your analysis but requires an integer in order to produce results.

## Intermediate Imputations

## Hot deck imputation

A randomly chosen value from an individual in the sample who has similar values on other variables. In other words, find all the sample subjects who are similar on other variables, then randomly choose one of their values on the missing variable.

For a good result, always change the value for ‘initialvalues’ to anything other than 0.

**Advantages:**

- Constrained to only possible values
- The random component adds in some variability. This is important for accurate standard errors.

## Cold Deck Imputation

A systematically chosen value from an individual who has similar values on other variables. This is similar to Hot Deck in most ways, but removes the random variation. Just chnage the

By keeping the ‘initialvalues’ at 5, we avoid randomness.

## Regression Imputation

The predicted value obtained by regressing the missing variable on other variables. So instead of just taking the mean, you’re taking the predicted value, based on other variables. This preserves relationships among variables involved in the imputation model, but not variability around predicted values.

## Stochastic Regression Imputation

The predicted value from a regression plus a random residual value. This has all the advantages of regression imputation but adds in the advantages of the random component.

## Imputation Using k-NN:

The *k* nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘**feature similarity**’ to predict the values of any new data points. This can be very useful in making predictions about the missing values by finding the *k’s* closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood.

The process is as follows: Basic mean impute -> KDTree -> compute nearest neighbours (NN) -> Weighted average .

**Pros:**

- Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).

**Cons:**

- Computationally expensive. KNN works by storing the whole training dataset in memory.
- K-NN is quite sensitive to outliers in the data.

## Imputation Using Multivariate Imputation by Chained Equation (MICE)

This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.