How to Handle Missing Values Systematically In A Dataset

A Comprehensive Flowchart Guide

Richard Warepam
ILLUMINATION
7 min readMar 17, 2024

--

Data, like a story, unfolds its tale, when missing values, we systematically unveil.

Handling missing values might seem like a minor part of your data preparation process, but it’s actually a critical step that can significantly impact your model’s performance or your analysis.

As a beginner in data science, I didn’t pay much attention to this task. Like many of us, I was somewhat oblivious.

I used to apply certain functions that we’re typically taught in most courses, whether it’s in Python, SQL, or any other language, without fully understanding them.

Unfortunately, no one ever taught us a “systematic” approach to handling these missing values.

Image by Author

It might appear straightforward, but gaining a deeper understanding of this process can enhance your thought clarity and even boost your confidence in your work.

So, are you ready to dive in?

Are there missing values in our dataset?

Image by Author

Firstly, once we’ve loaded our dataset, it’s essential to check for missing values. This can be done using Python, SQL, or any other language you typically use for your data science projects.

If you’re interested in learning more about this, I recommend checking out these articles of mine:

After performing this check, you’ll find one of two outcomes: either yes, there are missing values, or no, the dataset is free from missing values.

If the answer is: No.

Then, without hesitation, if there are no missing values, we can move forward with other pre-processing tasks such as data integration, data transformation, or data selection.

In this manner, the data is pre-processed and ready for any task, whether it’s building models or analyzing the data.

Now comes the main point of the article; the answer is: Yes.

Pay close attention. The moment you discover missing values in your dataset, determine the nature of the missing data.

There are three types of missing data:

  1. NMAR — Not Missing at Random
  2. MCAR — Missing Completely at Random
  3. MAR — Missing at Random

If you’re familiar with these, you might also know how to handle them. Each type of missing data has its own method of handling it.

But if you’re like me when I started my data science journey, let me explain these to you:

What is NMAR?

  • This is when the missing value is related to the hypothetical value. It might sound confusing, but the simplest way to understand this is that missing data are NMAR if, even given all the observed information, the probability of missingness depends on the unobserved missing values themselves.

Now, what is MAR?

  • In this case, missing data are MAR if the probability of missingness is independent of the missing values given the observed data. In other words, under MAR, how likely a value is to be missing can be estimated based on the non-missing data.

Lastly, what is MCAR?

  • Here, missing data are MCAR if the probability of missingness is independent of the data. In other words, the data are MCAR if the reason for missing values in the outcome or predictors has nothing to do with the data values themselves, whether observed or missing. This type of missing data is quite unrealistic.

If you find these types of missing data difficult to understand, let me know in the comments. I will write an article on it according to the response.

Now, as you already know the types of missing data in your dataset, are you ready to handle them?

Handling the missing data.

In my experience (not an expert, but seasoned), there are three strategies to tackle missing values. They are:

  1. Deletion: This involves eliminating data points or features with missing values.
  2. Imputation: This involves filling in missing values using statistical methods or machine learning algorithms.
  3. Algorithm Modification: In this approach, we use algorithms that inherently handle the missing values.

Now, let’s address each type of missing data.

1. NMAR (Not Missing at Random)

Image by Author

If the missing value is “NMAR,” it’s important to note that the missingness depends on the missing value itself. Therefore, we need to exercise extra caution when filling in these missing values.

  • So, the only option available is to perform a deletion of the missing data. The most effective type of deletion is “listwise deletion.”

In listwise deletion, we eliminate any data record that contains a missing value.

2. MAR (Missing at Random)

Image by Author

If the missing value is “MAR,” it signifies that the missingness follows a pattern and depends on the observed non-missing data. Here, we can address these through deletion, imputation, and algorithms.

Deletion:
To eliminate these missing values, we need to employ “listwise deletion.”

Imputation:
If you wish to impute values for these missing values, we need to utilize “regression imputation” or “mean/median imputation.”

  • Regression imputation” is the method where we predict the missing values using regression analysis.
  • And “mean/median imputation” is the process of substituting the missing values with the mean/median of the observed data.

Algorithms:
Finally, if you wish to fill in the missing values using algorithms, we can use advanced algorithms like “Expectation maximizations” and “Maximum Likelihood.”

If you’re unfamiliar with these terms,

  • Expectation maximization is the iterative method of estimating missing values by maximizing the likelihood function, and
  • Maximum likelihood is the method of using likelihood-based statistical methods to handle the missing data without imputing it.

3. MCAR (Missing Completely at Random)

Image by Author

For this final type of missing value, we can employ all three methods, similar to MAR: deletion, imputation, and algorithms.

Deletion Methods:
For MCAR-type missing values, we have two deletion methods at our disposal.

  • The first is listwise deletion, where we eliminate any data record that contains a missing value.
  • The second is “Pairwise deletion”, where we utilize all the cases in which the variables of interest are present, disregarding the cases where these variables are missing.

Imputation Methods:
As for imputation methods for MCAR-type missing values, we have three imputation methods.

  • We can perform the most commonly used method, mean/median imputation. Here, we replace the missing values with the mean or median of the observed values (non-missing ones).
  • The second option is regression imputation; we can also use regression analysis to predict the values of the missing ones and fill them in.
  • Lastly, there is “Hot/Cold Deck Imputation.” In this type of imputation, we fill in the missing values by sampling from the observed values in the dataset. (by observing similar records)

Advanced Algorithmic Methods:
There are three ways we can handle missing values using algorithms.

  • The first option is Expectation maximization, where we use an iterative process that estimates the missing values based on maximizing the likelihood function.
  • The second is Maximum likelihood, where we use statistical methods to estimate parameters and handle the missing data.
  • Lastly, it is “Case Substitution.” This is used only when prior knowledge is available. Here, we substitute the missing cases with similar observed cases based on specific criteria using prior knowledge

All these methods of handling MACR types of missing values can also be used for MAR types of missing values.

But I didn’t add all to the MAR section because I wanted to highlight the best options only for MAR.

Wrapping things up:

In this article, I’ve walked you through the comprehensive process of handling any kind of missing value in any dataset.

If you find this guide helpful, consider supporting my work by:

Click here to appreciate my work as I am not able to join MPP from India.

Or, learn data science with me and get some of the best eBooks I authored for data science and AI tools:

  1. Personal INTERVIEW Ready “SQL” CheatSheet
  2. Personal INTERVIEW Ready “Statistics” Cornell Notes
  3. ChatGPT for Learning Data Science.
  4. The Ultimate ChatGPT Bundle (Cheapest)
  5. The Ultimate Data Science Bundle (Cheapest)

Best Selling eBook: Top 50+ ChatGPT Personas for Custom Instructions

--

--

Richard Warepam
ILLUMINATION

Worked as Developer | Passionate about Data Science | Writes on Data Science (AI/ML) | Learn A/B Testing for FREE: https://codewarepam.gumroad.com/l/mzqecj