Missing Values in Data Analysis: A Comprehensive Guide

Diogo Ribeiro
Data Science as a Better Idea
5 min readJul 10, 2023
Photo by Ehimetalor Akhere Unuabona on Unsplash

Missing data, or missing values, are common occurrences in data science and business settings. These instances occur when no data or value is stored for a particular observation within a variable. Missing data can significantly affect the conclusions that can be drawn from the data, making it a critical factor to consider in data analysis. Incomplete data is an unavoidable issue when dealing with most data sources.

Why is Data Missing?

The causes of missing data can be diverse. Some instances of missing data may occur because the data was lost, forgotten, or not stored properly. In other cases, there might be no existing value for a certain observation or the value cannot be identified or known.

For instance, imagine that the data comes from a survey, and the data is entered manually into an online form. If a field in the form is not filled, it results in a missing value. Similarly, if a person chooses not to disclose certain information, like their income, it would result in a missing value. In other situations, a certain feature might not be calculated for a specific individual. If the individual has no income, the variable ‘total debt as a percentage of total income’ does not exist, resulting in a missing value.

Understanding the source of missing data is critical as it helps determine how to process the missing values and may provide insights into how to control the source of missing data in the future.

Missing Data Mechanisms

Three main mechanisms lead to missing data: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).

Missing Completely at Random (MCAR): When data is MCAR, there is no relationship between the missing data and any other values within the dataset. In other words, the missing data points are a random subset of the data, and there’s nothing systematic that makes some data more likely to be missing than others.

Missing at Random (MAR): MAR occurs when there is a systematic relationship between the propensity of missing values and the observed data. This means the likelihood of an observation being missing depends on available information. For instance, if men are more likely to disclose their weight than women, weight data is MAR.

Missing Not at Random (MNAR): MNAR occurs if the missing values depend on information not recorded in the dataset. For instance, if people with high levels of depression were less likely to fill in a depression survey, the missing data is MNAR.

Understanding the mechanism of missing data is key to deciding which methods to use to handle missing values.

Real-life examples of Missing Values in Data

Predicting Survival on the Titanic

The Titanic dataset is a widely-known example of missing values in data. By analyzing the probability of survival based on attributes like gender, age, and social status, predictions can be made on which passengers would have survived. Some groups were more likely to survive, revealing society’s priorities and privileges at the time.

Peer-to-peer lending: Finance

The Lending Club dataset is another instance where missing values play a significant role. The dataset contains complete loan data for all loans issued through 2007–2015, including loan status and payment information. However, missing data can occur in areas like credit scores, finance inquiries, addresses, and collections.

To work with these datasets, you can download the Titanic dataset here and the Lending Club dataset here.

Understanding and handling missing values is crucial in data analysis. The mechanisms of missing data — MCAR, MAR, MNAR — provide a roadmap to deal with missing values, enabling more accurate and comprehensive data analyses. As the field of data science continues to evolve, so too will the techniques and strategies for handling missing values.

How to Handle Missing Values

Identifying the mechanism of missing data is a critical step towards deciding how to manage those missing values. Depending on the underlying mechanism, different strategies can be employed.

1. Deletion: If the data is Missing Completely at Random (MCAR), then one way to handle missing values is to ignore those cases with missing data. This method, however, might lead to reduced statistical power, biased estimates, and lower accuracy if the missing data is not MCAR.

2. Imputation: This involves replacing missing data with substituted values. These could be a mean, median, or mode for numerical data and the most common category for categorical data. This method might introduce bias in the data, so it is essential to understand the nature of the missing data and assumptions about its distribution before proceeding with imputation.

3. Prediction Models: Regression, machine learning, or deep learning models can be used to predict and replace missing values based on other data. This is a more sophisticated way to handle missing data but might also lead to overfitting if not appropriately applied.

4. Multiple Imputation: This involves creating multiple filled-in copies of the dataset, analyzing each one separately, and then combining the results. This method can provide more accurate estimates and confidence intervals than simple imputation.

Case Studies on Handling Missing Data

Titanic Dataset:

For instance, in the Titanic dataset, there are missing values for the variable ‘Cabin’. If we consider that data is missing at random and may be related to the class of the ticket (Pclass), we could fill in these missing values based on the class. For example, for those in the first class, fill in with ‘C’, for those in second class, with ‘E’, and those in third class with ‘F’. This is an instance of using the imputation method.

Lending Club Dataset:

In the Lending Club dataset, a variable like ‘income’ might have missing values. If we consider the data to be MAR, we could use regression or machine learning models to predict the income based on other variables like employment history, education, and location.

Conclusion

Handling missing values is an essential part of the data cleaning process in data science. Understanding why data is missing and identifying the mechanism by which it is missing helps in deciding the most appropriate method to handle such missing values.

These methods, though not perfect, aim to make the best use of the available data. It’s crucial to keep in mind that every dataset and missing value scenario is different. Thus, the methods should be tailored according to the nature of the data, the extent of the missing data, and the analysis goals.

In an era where data-driven decisions are becoming increasingly important, dealing with missing data effectively is a critical skill for every data scientist.

--

--