Missing Data, No Problem: Techniques for Handling Incomplete Datasets

Honzik J
12 min readMar 14, 2023

--

Handling missing values

Missing values are a common problem in real-world datasets, and handling them is a crucial step in data pre-processing. They can occur due to various reasons, such as the structure and quality of the data, data entry errors, data loss during transmission, or incomplete data collection. These missing values can impact the accuracy and reliability of machine learning models, as they can introduce bias and skew your results. Some models don’t work at all with missing values. Therefore, handling missing values appropriately is essential before you build your model.

Techniques covered in this article

This article will show how to handle these missing values in three difficulty levels and covers the following techniques:

  • Level 1: Deletion, Mean/Median/Mode imputation, Estimate using domain knowledge
  • Level 2: Regression imputation, K-Nearest Neighbours (KNN) imputation
  • Level 3: Multivariate imputation by chained equations (MICE), MICEforest

Let’s begin!

Check for missing values

First, we must check how many missing values we have in each feature. You may have already done this as part of your exploratory data analysis, but if not, we can do this with the following code:

Let’s import pandas and read them in our dataset. For the following examples, we will be using the wine quality dataset.

import pandas as pd

# Read in dataframe
df = pd.read_csv('Wine_Quality.csv')

We can then check for missing values with the following line of code.

# Check for missing values
df.isnull().sum()
Missing values for each feature

It’s that simple!

We can view rows containing missing values in any feature using the following:

# Filter for any rows containing missing values in any column
df_filtered = df[df.isnull().any(axis=1)]

# Print the filtered dataframe
df_filtered.head()
Dataframe rows containing nulls in any feature

Now we can begin handling these missing values. There are several techniques for handling missing values.

Level 1: Beginner

Deletion

The simplest approach is deletion. This involves removing the rows or columns (features) with missing values. This is usually done when the percentage of missing values is very small or when the missing values do not significantly impact the analysis or results.

To delete a row with missing values:

# Read data
df = pd.read_csv('Wine_Quality.csv')

# Drop rows with null values
df_droprows = df.dropna()

# View nulls
df_droprows.isnull().sum()

We can also drop columns or features using the following:

# Read data
df = pd.read_csv('Wine_Quality.csv')

# Drop all features with null values
df_dropcols = df.drop(columns=['type', 'fixed acidity', 'citric acid', 'volatile acidity', 'residual sugar', 'chlorides', 'pH', 'sulphates'])

# View nulls
df_dropcols.isnull().sum()

Note that by dropping rows, we end up with a shorter dataset. When dropping features, we end up with a full-length dataset however lose features.

# Compare shape of resulting dataframes when dropping rows vs dropping features
print("Shape when dropping rows: ", df_droprows.shape)
print("Shape when dropping features: ", df_dropcols.shape)
Comparison of shape when dropping rows or features

The method you choose depends on what features you need to use in your dataset and how many observations you’ll need. Neither of these methods is particularly sophisticated, and both lead to losing valuable data — so let’s explore some other options.

Mean/Median Imputation

The next possible method is replacing the missing values with the mean or median of the feature. In this case, we do not lose features or rows. This should only be used for numerical features (and if using mean, we should ensure the dataset is not skewed or contains significant outliers).

To impute the missing value with mean:

# fill missing values with the mean of each column
df = df.fillna(df.mean())

Now let’s check one of the imputed values. Here is a row we know to have had a missing value for volatile acidity:

# Let's check a row with an imputed value for 'volatile acidity'
df[df.index==86]
Mean imputed value for ‘volatile acidity.’

To impute with median, we can use:

# Read data
df = pd.read_csv('Wine_Quality.csv')

# Fill missing values with the median of each column
df = df.fillna(df.median())
# Let's check a row with an imputed value for 'volatile acidity'
df[df.index==86]
Median imputed value for ‘volatile acidity.’

We can see that there is a different value when using mean and median imputation.

Mode Imputation

Like the above, this method replaces the missing values with the feature’s mode or most commonly occurring value. This should only be used for categorical features where one category is dominant.

Let’s apply this to our categorical feature — type. First, let’s check if one category is dominant. We can achieve this with the value_counts method:

# Read data
df = pd.read_csv('Wine_Quality.csv')

df['type'].value_counts()
Value counts for wine type

We can see that there is a dominant category (white). Therefore we can impute with mode with the following:

# Fill missing values with the mode of each column
df['type'] = df['type'].fillna(df['type'].mode())

Using Scikit-Learn’s SimpleImputer class

Mean, median and mode imputation can also be performed using Scikit-learn’s SimpleImputer class. Set the strategy to either ‘mean’, ‘median’, or ‘most_frequent’.

from sklearn.impute import SimpleImputer

# Read data
df = pd.read_csv('Wine_Quality.csv')

# Make sub dataframe with only numeric features
df_numeric = df.drop(columns='type')

# Create an instance of the SimpleImputer class
imputer_median = SimpleImputer(strategy='median')

# Fit the imputer to the data
imputer_median.fit(df_numeric)

# Transform the data
df_imputed_median = pd.DataFrame(imputer_median.transform(df_numeric), columns=df_numeric.columns)

# print the first 5 rows of the imputed data
df_imputed_median.head()

You can also set the strategy to ‘constant’ and specify a ‘fill_value’ to fill with a constant value:

Advantages of mean/median/mode imputation:

  • Simple and fast to implement
  • It preserves the sample size and reduces the risk of bias in downstream analyses, such as machine learning models.
  • It is less computationally expensive than more complex imputation methods.

Disadvantages of mean/median/mode imputation:

  • It does not account for the variability or distribution of the data, potentially resulting in imputed values that are not representative of the true values.
  • It can underestimate or overestimate missing values, particularly in datasets with extreme values or outliers.
  • It can lead to reduced variance and artificially inflated correlation coefficients in the imputed dataset.
  • It assumes that the missing values are Missing Completely at Random (MCAR), which may not always be the case (see level 2).

Estimate using domain knowledge

Another possible approach to dealing with missing data is to replace the missing values with estimates based on domain knowledge or business rules.

For example, let’s say you are analysing customer data for a retail company and notice that some of the customer records are missing their income information. You could replace these missing values with estimates based on the customer’s demographic information.

To do this, you could consult with subject matter experts, such as marketing analysts, who could provide insights into the typical income range for customers in each demographic category, allowing you to impute missing income values that are reasonable and plausible.

When replacing missing data with informed estimates, it is crucial to consult with subject matter experts who possess a deep understanding of the domain and the data to ensure it yields accurate and reliable imputations.

Level 2: Intermediate

In this level, we will dive into some slightly more advanced techniques to fill in those missing values. Instead of relying on basic measures like the mean, median, or mode, we will take it up a notch and use a predictive model to tackle the issue. But before we can do that, we need to understand the nature of the missing values better.

Types of missingness

Before we proceed with more advanced techniques, let’s consider the types of missingness we can encounter in our dataset. There are different types of missingness in a dataset, and understanding the type of missingness can help determine the appropriate imputation method. Here are some common types of missingness:

  1. Missing Completely at Random (MCAR): In this type of missingness, the missing values are completely random, meaning that the probability of a value being missing does not depend on any variable, observed or unobserved. For example, if a survey respondent accidentally skips a question on a survey, this is MCAR.
  2. Missing at Random (MAR): In this type of missingness, the probability of a value being missing depends on observed variables but not on the missing values themselves. For example, if a survey respondent is less likely to answer a sensitive question, but the propensity to not answer the question is dependent on observable variables (such as age, gender, and education), then this is MAR.
  3. Missing Not at Random (MNAR): In this type of missingness, the probability of a value being missing depends on unobserved variables, including the missing values themselves. For example, if individuals with higher levels of depression are less likely to report their depression levels, and the propensity to not report is not observable in the data, then this is MNAR.

Regression Imputation

Here we will use a regression model to make an educated guess about those missing values. How? By analysing the other features in our dataset and using their correlation to fill in the gaps.

Regression imputation is particularly useful when dealing with cases of missing data that follow a pattern, also known as Missing at Random (MAR) or Missing Completely at Random (MCAR). This method works well when there is a strong correlation between features.

Let’s give it a go. Since we will only be imputing numeric features, we’ll create a version of the datagram that excludes our categorical feature. Then we can fit a linear regression model for each column with a missing value. For this, we will need Scikit-learn’s LinearRegression module.

import pandas as pd
from sklearn.linear_model import LinearRegression

# Read data
df = pd.read_csv('Wine_Quality.csv')

# Make sub dataframe with only numeric features
df = df.drop(columns='type')

# Separate the columns with missing values
missing_cols = df.columns[df.isna().any()].tolist()
non_missing_cols = list(set(df.columns) - set(missing_cols))

print(missing_cols)

# loop over each column with missing values
for col in missing_cols:
# Create a copy of the dataframe without missing values in the current column
df_temp = df.dropna(subset=[col] + non_missing_cols)

# Split the dataframe into features (X) and target variable (y)
X = df_temp[non_missing_cols]
y = df_temp[col]

# Create and fit a linear regression model
lr = LinearRegression()
lr.fit(X, y)

# Impute missing values in the current column using the fitted model
df.loc[df[col].isna(), col] = lr.predict(df.loc[df[col].isna(), non_missing_cols])
Regression imputation on ‘volatile acidity.’

Advantages of regression imputation:

  • It can handle a large number of missing values.
  • It can preserve the statistical properties of the dataset, such as the mean, variance, and correlation coefficients.
  • It can improve the accuracy of downstream analyses, such as machine learning models, by reducing bias and increasing sample size.

Disadvantages of regression imputation:

  • It assumes a linear relationship between the missing variable and the observed variables.
  • It may introduce bias if the missing values are not Missing at Random (MAR) or Missing Completely at Random (MCAR).
  • It may not work well with categorical or ordinal variables.
  • It can be computationally expensive and time-consuming, especially with large datasets.

K-Nearest Neighbours (KNN) imputation

Another approach is to bring in a machine learning model like K-Nearest Neighbours (KNN) to help us estimate those missing values. This works similarly to regression imputation, simply with a different algorithm to predict the missing values.

Here we use Scikit-learn:

import pandas as pd
from sklearn.impute import KNNImputer

# Read data
df = pd.read_csv('Wine_Quality.csv')

# Make sub dataframe with only numeric features
df = df.drop(columns='type')

# create a KNN imputer object
imputer = KNNImputer(n_neighbors=5)

# impute missing values using KNN
df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

We can also do this using the fancyimpute package:

pip install fancyimpute
# Import the necessary libraries
import numpy as np
import pandas as pd
from fancyimpute import KNN

# Load the dataset
df = pd.read_csv('Wine_Quality.csv')

# Drop non-numeric features
df = df.drop(columns='type')

# Get list of columns with missing values
missing_cols = df.columns[df.isna().any()].tolist()

# Create an instance of the KNN imputer
imputer = KNN()

# Fit and transform the imputer to the dataset
imputed_array = imputer.fit_transform(df[missing_cols])

# Replace the missing values in the original dataset
df[missing_cols] = imputed_array

# View the imputed dataset
df

Advantages of KNN imputation:

  • It can capture complex non-linear relationships between variables.
  • It does not make assumptions about the distribution of the data or the correlation between variables.
  • It can be more accurate than simpler imputation methods, such as mean or median imputation, especially with small to medium-sized datasets.

Disadvantages:

  • It can be computationally expensive, especially with large datasets or high-dimensional data.
  • It can be sensitive to the choice of distance metric and the number of nearest neighbours selected, which can affect the accuracy.
  • It may perform poorly with highly skewed or sparse data.

Level 3: Advanced

Multivariate imputation by chained equations (MICE)

MICE is a commonly used method for imputing missing data. It works by replacing each missing value with a set of plausible values based on a model that takes into account the relationships between variables in the dataset.

The algorithm starts by creating a predictive model for each variable in the dataset, based on the other variables that are complete. The missing values for each variable are then imputed using the corresponding predictive model. This process is repeated multiple times, with each round of imputation using the previous round’s imputed values as if they were true, until convergence is reached.

The multiple imputed datasets are then combined to create a final dataset with imputed values for all missing data. MICE is a powerful and flexible method that can handle datasets with many missing values and complex relationships between variables. It has become a popular choice for imputing missing data in many fields, including social science, health research, and environmental science.

We can implement it using the fancyimpute package

import numpy as np
import pandas as pd
from fancyimpute import IterativeImputer

# Read data
df = pd.read_csv('Wine_Quality.csv')

# Convert type column to category (so that miceforest can handle as a categorical attribute rather than string)
df= df.drop(columns='type')

# Get list of columns with missing values
missing_cols = df.columns[df.isna().any()].tolist()

# Create an instance of the MICE algorithm
imputer = IterativeImputer()

# Fit the imputer to the dataset
imputed_array = imputer.fit_transform(df[missing_cols])

# Replace the missing values in the original dataset
df[missing_cols] = imputed_array

# View the imputed dataset
df

Note that this implementation does not touch categorical variables

MICEforest imputation

Just as fun as it sounds… MICEforest is a variation of MICE (Multiple Imputation by Chained Equations) that uses a lightGBM algorithm to impute missing values in a dataset, rather than regression.

We can implement it using the miceforest package

pip install miceforest --no-cache-dir

conda install -c conda-forge miceforest
import pandas as pd
import miceforest as mf

# Read data
df = pd.read_csv('Wine_Quality.csv')

# Convert type column to category (so that miceforest can handle as a categorical attribute rather than string)
df['type'] = df['type'].astype('category')

# Create an instance of the MICE algorithm
imputer = mf.ImputationKernel(data=df,
save_all_iterations=True,
random_state=42)

# Fit the imputer to the dataset. Set number of iterations to 3
imputer.mice(3, verbose=True)

# Generate the imputed dataset
imputed_df = imputer.complete_data()

# View the imputed dataset
imputed_df

Conclusion

We have explored some (but not all) of the common methods available to help us handle the problem of missing data. Ultimately, the choice of method will depend on the dataset, the amount of missing data, and the analysis goals. It is essential to carefully consider the potential impact of imputing missing data on the final results. Seeking input from subject matter experts and evaluating the quality of imputed data can help ensure the validity of subsequent analyses.

Dealing with missing data is a critical step in data analysis, and using the suitable imputation method can help us unlock the insights hidden within our data.

Links:

--

--