Handling Missing Values for Machine Learning

Devanshi Shah
10 min readSep 16, 2022

--

The real-world dataset generally contains missing values. Values can be missing due to several reasons i.e., Missing information, failure to record data, data violation or corruption, and lost files. Handling missing values is considered a primary concern because machine learning algorithms do not accept missing values.

There are no perfect ways to handle missing values.

This post contains different types of missing values and ways to handle them.

Usually, people directly jump to handle missing values without identifying, why is it missing? But first, we should try to know the reason for values being missing, and then we can select and apply techniques to handle them.

The reason for missing data can be divided into three categories: -

1) MCAR: If the missing values do not have any relation to the column it belongs to and with the rest of the columns in the dataset then it is called Missing Completely at Random.

For example: If we have a dataset of state elections with columns of religion, age, gender, and votes, where the votes column indicates a person’s vote for a particular political party. We found that the votes column had a few missing values which were not related to age, gender, or religion column and the votes column itself.

2) MAR: If the missing values do not have any relation to the column it belongs to but have a relation with any other columns in the dataset it is called Missing At Random

For example: Consider the same election dataset. We identified that people from particular religions did not come to give their votes in other words, values in the votes column are missing when the religion column has ‘XYZ’ value. This means missing values in the votes column have a relation with values in the religion column. This is called Missing at Random

3) MNAR: If the missing values do have a relation to the column it belongs to but do not have a relation with any other columns in the dataset it is called Missing Not at Random.

For example: In the same election dataset. We noticed that votes for a specific party were not able to collect due to any unknown reason, and that’s why values are missing for that party in the votes column. Here the missing values in the votes column have a relation with the column itself. Hence, it is called Missing Not At Random.

Now, we will discuss techniques to handle missing values.

A. Remove Entire Row or Column: In this technique, we remove the entire row or column wherever missing values are found. It is also known as Complete Case Analysis (CCA) or “list-wise deletion” of cases, consisting in discarding rows where values in any of the columns are missing.

CCA means analyzing only those rows for which there is information in all columns of the dataset.

When to use this technique?

• This is used when missing data is less than 5% (remove rows) or greater than 80–90% (remove column)

• If the values are Missing Completely at Random (MCAR)

Advantages:

• Easy to use and implement as no complex logic is required.

• If a dataset is MCAR, then the distribution of columns after removing the rows will match the distribution of the original dataset.

• Here in the below diagram, after removing missing data the distribution of the new dataset which is in green color is almost similar to the distribution of the dataset with missing values in red color.

Disadvantages:

• It can exclude a large fraction of the original dataset

• Excluded observations could be informative for the analysis (if data is not missing at random)

• As there is no logic behind this technique, when we put our model on production, the model will not know how to handle new missing values

B. Imputation Techniques

Now, we will discuss ways to Impute Numerical missing values in the dataset.

  1. Univariate Imputation: In this approach, only a single feature is taken into consideration. For instance, if a value is missing in column A then, only column A will help to impute the missing values.

This imputation technique can be applied to both Categorical and Numerical values.

For Numerical Values, we have 4 different ways to impute them:

1) Mean/Median

2) Arbitrary

3) End of distribution

4) Random

I. Mean/Median Imputation: Missing values can be filled out with the mean/median value of a specific column.

When to use this technique?

• Only when values are missing completely at random

• When the distribution of the data is normal then we should use mean

• When the distribution of the data is skewed then we should use median

• Only when missing data in the column is less than 5%

Advantages:

• Simple and easy to implement

• Easy to implement on the server also

Disadvantages:

• It may create extra outliers in the data

• It changes the correlation/covariance in the dataset

• It also changes the distribution of the dataset after imputing the missing values

2 Methods to impute data with mean and median:

1) Using pandas fillna()

2) Using the sckit-learn method

II. Arbitrary Values Imputation: In this method, missing values can be filled out with a number that is not available in that specific column. This helps the model to understand the difference between the original values and imputed values. For instance, if we have an “age” column with few missing values in our dataset, we fill out all these missing values with either -1 or 101.

When to use this technique?

• Only when data is Not Missing At Random

Advantage:

• Simple and easy to implement

Disadvantages:

• It also changes the distribution of the dataset after imputing the missing values

• It changes the correlation/covariance in the dataset

Methods to impute data with mean and median:

1) Using pandas fillna()

2) Using the sckit-learn method with applying strategy = constant

III. End of Distribution Imputation: In this method, instead of replacing missing values with an arbitrary value, we replace them with outliers. The reason for replacing it with outliers is, that sometimes it is challenging to decide on an arbitrary number in the dataset. So, instead of replacing it with a random number, we can replace them with outliers. Here we have 2 things to be noted. Firstly, if the column for which we are going to apply this imputation technique is already normally distributed, then the value to impute missing information will be mean + 3(sigma) or mean — 3(sigma). Secondly, if the column is skewed then we use the IQR proximity rule and it will be calculated as Q1–1.5 (IQR) or Q3–1.5(IQR). The goal is to tell machine learning models with this imputation technique, that these are extreme value(outliers) so that it doesn’t mismatch with the original values.

When to use this technique?

• Only when data is Not Missing At Random

Advantages:

• Simple and easy to implement

• Normally distributes dataset

Disadvantages:

• It also changes the distribution of the dataset after imputing the missing values

• It changes the correlation/covariance in the dataset

IV. Random Imputation: For this method, we impute missing values randomly from the present information for that particular column. This method is applicable for both categorical as well as numerical columns. For instance: If we have an age column with some missing information, it can be filled out by selecting a random value from the age column. There are higher chances of the imputed data being normally distributed because the randomly filled values are mostly those values that are occurring frequently in the age column.

When to use this technique?

• This technique works better when dealing with Linear and Logistic regression algorithms instead of tree-based algorithm

Advantages:

  • Well suited for Linear models as it does not distort the distribution, regardless of the % of missing data

Disadvantages:

• Covariance changes between data because we introduced randomness in our dataset

• Memory heavy for deployment, as we need to store the original training column to extract values from and replace NA in coming observations

For Categorical missing values, we have 2 different ways to impute them

1) Mode (Most Frequent)

2) Missing/Unknow Indicator

I. Mode (Most Frequent): As we cannot use mean and median values to impute data for categorical values, we use mode to impute them. Mode is the value that occurs most frequently in the column. This method can also be applied to numerical values, but the mean and median replacement method works better on them. Whereas the mode imputation method works better for categorical values.

When to use this technique?

• Only when data is Missing Completely At Random

• While filling the missing information with mode, the values used for filling them should be occurring the highest times in the dataset

• Only when missing data is less or equal to 5% in the dataset

Advantages:

• Easy and Simple to implement

• Can be handled with SimpleImputer sklearn with, strategy = most_frequent

Disadvantage:

• Changes the distribution of data

II. Missing/Unknown Indicator: When the data is missing more than 10% in the column, we cannot use the mode method to impute missing values because it will change the distribution of the dataset by a high percentage. It is similar to the arbitrary imputation method, in which we use an arbitrary number to impute NA values and here we cannot use a random number such as -1 or 99.99, we use a “missing” word to impute them. So that the machine learning model can make a difference between available values and missing values in the dataset.

When to use this technique?

• Only when the missing values are more than 10% in the data

• Only when data is Not Missing At Random

Advantages:

• Easy and simple to implement

• Can be handled with SimpleImputer Sklearn with, strategy = “constant” and, fill_na = “missing”

Disadvantages:

• Creates randomness in the dataset as we are not imputing missing values but creating a new category for missing values.

• Does not give the best result

2) Multivariate Imputation: In this approach, multiple features are taken into consideration. For instance, if a value is missing in column A then, columns B, C, D, and any other rows will help to impute the missing values.

We have 2 main different techniques to impute multivariate features

1) KNN Imputation

2) Iterative Imputation

I. K- Nearest Neighbor (KNN): This method utilizes the k- Nearest Neighbor imputation technique to replace the missing values in the column by identifying the most related/identical rows in the dataset (also called Nearest Neighbors). By default, it uses the Euclidean distance metric to impute the missing value.

Here, K represents a number of neighbors to consider while replacing the missing value. For instance, if k = 3, then the 3 most related rows are identified, and the missing value is imputed by taking the mean of these 3 related rows.

To learn more about KNN imputation, watch out this video:

https://www.youtube.com/watch?v=T3c7JsLwgpE

When to use this technique?

• When more accuracy and high performance are needed

Advantages:

• More accurate method for smaller and medium size datasets

• Better accurate the mean and median method

Disadvantages:

• More number calculations are needed, as we need to calculate the distance for missing values with their multiple nearest neighbors

• Memory heavy for deployment, as we need to store the original training set to extract values from and replace NA in coming observations

II. Iterative Imputation: It uses the MICE algorithm where MICE stands for Multivariate Imputation by Chained Equations which is a famous approach to fill missing values. In Iterative imputation, each missing value is filled one after the other, allowing prior imputed values to be used as a part of a model predicting subsequent features. As the name says, it is iterative because this process is repeated multiple times.

To learn more about Iterative Imputation, please watch out this video:

https://www.youtube.com/watch?v=m_qKhnaYZlc

When to use this technique?

• It is better to use when data is Missing at Random (MAR), we can also use it on Missing Not at Random (MNAR) and Missing Completely at Random (MCAR) but works better on MAR

Advantage:

• It gives a more accurate result

Disadvantages:

• This method works slow

• Memory heavy for deployment, as we need to store the original training set to extract values from and replace NA in coming observations

Conclusion:

To sum up, several techniques are available to handle missing values. But we should first identify the reason behind values being missed and then select any imputation technique accordingly by keeping the assumption, advantages, and disadvantages in mind. I have discussed the most famous techniques here; however, other techniques are also available to handle missing values. Furthermore, researchers have researched and published many interesting methods to impute missing values. And, many Kagglers and engineers have used published methods that helped them to achieve higher accuracy. But in real-world scenarios, many other parameters need to be considered including speed, training time, and memory storage for the production of any imputation technique.

In this blog, I tried to explain the theory to handle missing values. I hope you had fun reading this. If you have any questions, please let me know in the comment section. Meanwhile, keep reading and learning!

--

--