Imputation Techniques for Numerical Features
Jatin Madan, Nik Bear Brown
ABSTRACT
One of the most important realizations of working with information is that data never comes neatly organized. Real-world data is, merely by its numerical nature, invariably messy, requiring significant clean-up to render it usable. This is data noise.
Almost all datasets contain some noise, and the less noise there is, the better (“cleaner”) the dataset. Some kinds of noise are easier to correct than others; sometimes removing the noise completely is simply impossible and all you can do is resample or hope for the best, keeping in mind that when garbage goes in, garbage comes out.
This notebook explores a form of both noise and not-noise that’s often discussed but rarely quantified, and critical to understanding: null data. We’ll dive into different types of imputation techniques for numerical features and will discuss the advantages and disadvantages of some like mean imputation, median imputation, hot-deck imputation, and k-nearest neighbors imputation. We will also discuss the impact of missing data on the accuracy and performance of machine learning models and evaluate the effectiveness of these imputation techniques on various datasets.
SO WHAT IS A NULL DATA?
Let’s take an example. Suppose you had the following data which has a sample of columns from the NYPD Motor Vehicle Collision Dataset:
Each of the records corresponds with an accident reported and tended to by the NYPD at some location. Yet some records are missing from this data.
This is an example of missing data — data that we know exist, but which, due to sparse or incomplete data collection, we do not actually know the value of.
TYPES OF MISSING VALUES
The three types of missing data are:
Missing Completely At Random (MCAR): In this type of missing data, the probability of missing data is unrelated to both the observed and unobserved data. This means that the missingness is completely random and occurs purely by chance.
For example, if participants in a study are randomly selected to answer a survey, but some fail to return the survey due to reasons unrelated to the questions being asked, such as the survey being lost in the mail.
Missing At Random (MAR): In this type of missing data, the probability of missing data depends only on the observed data, and not the unobserved data. This means that the missingness is related to the observed data, but not to the unobserved data.
For example, in a medical study, patients who are older are more likely to drop out of the study, but their likelihood of dropping out is related only to their age, and not to any unobserved medical conditions.
Missing Not At Random (MNAR): In this type of missing data, the probability of missing data is related to both the observed and unobserved data. This means that the missingness is not random and is related to some unknown factor that is not captured in the data.
For example, in a survey about income levels, participants who earn higher incomes are less likely to report their income, regardless of whether or not they have any missing values on other variables.
WHY IS DATA MISSING FROM THE DATASET
Reasons for missing data from the dataset affect the approach to handling missing data. So it’s necessary to understand why the data could be missing.
Data can be missing from the dataset because of the following reasons:
- Past data might get corrupted due to improper maintenance.
- Observations are not recorded for certain fields due to some reasons. There might be a failure in recording the values due to human error.
- The user has not provided the values intentionally.
TYPES OF IMPUTATION METHODS FOR NUMERICAL FEATURES:
- USING MEAN VALUES
In this case, it is assumed that the data are missing completely at random (MCAR). All the missing values are replaced with the most frequently occurring variable i.e. means in the case of Gaussian distribution of the variable. If the data is skewed i.e. a large number of data points are acting as outliers then the best approach would be to replace the missing values with either median or mode.
Advantages:
- Easy and fast and works great with small numerical datasets
Disadvantage:
- A poor result in the case of encoded categorical variables
- Low accuracy and doesn’t account for uncertainties
- Adversely affect covariance and correlation
2. USING MEDIAN VALUES
Median imputation replaces missing values with the median of the non-missing values in the feature. Median imputation is a robust method that is less sensitive to extreme values and outliers than mean imputation. However, median imputation can lead to biased imputations if the feature distribution is highly skewed.
Advantages:
- Robust to extreme values and outliers.
- Simple and easy to implement.
- Doesn’t require additional statistical assumptions.
Disadvantages:
- This can lead to biased imputations if the feature distribution is highly skewed.
3. USING MODE VALUES
In this case, missing values within each column are replaced with the most frequent value, zero or constant value.
Advantages:
- Works well with categorical features
Disadvantage:
- Can introduce bias in the dataset
- Doesn’t factor in the effect of correlation between features
4. USING KNN
In this process, we use k nearest neighbor’s algorithm, which uses ‘feature similarity’ to predict the value of any new data point. K-NN imputation includes selecting the distance measure and the number of contributing neighbors for each prediction. The missing values are imputed with the non-missing values in the k’s closest neighborhood.
Advantage:
- More Accurate
Disadvantage:
- Computationally expensive
- Sensitive to outliers
5. USING HOT DECK IMPUTATION
The missing value is replaced with an observed response from a similar unit.
Advantage:
- Suited for MAR pattern
- Does not introduce biased marginal distribution
Disadvantage:
- Works with categorical data only
- Univariate imputation
6. USING DEEP LEARNING
It is a library (Datawig) In the Machine Learning model which makes use of Deep Neural networks to impute missing values.
Advantage:
- Quite accurate as compared to other techniques
- Its function can handle categorical data
- Supports CPUs and GPUs
Disadvantage:
- Single-column imputation
- Slow with large data
7. USING REGRESSION IMPUTATION
It uses a regression model to predict the missing values using data points of other variables. Firstly regression model estimates the observed data and then regression weights the missing values, predict them and they get replaced.
Advantage:
- Able to preserve the distribution shape of the variable, where NaN values belong
Disadvantage:
- Error in variance can lead data points to deviate from the regression line, where regression imputations lie.
In this notebook, we’ll see some of the imputation methods with the help of a dataset and how can they be used to impute the numerical data.
ABOUT THE DATASET
The Wine Quality dataset contains information about various attributes of different types of wines, including fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, and many more.
Missing data is a common problem in many datasets, and the Wine Quality dataset is no exception. In this dataset, we can explore the extent of missing data in the dataset, and perform various imputation techniques to fill in the missing values. You can also evaluate the performance of each imputation method and compare their results.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Imputation sklearn library
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('https://raw.githubusercontent.com/madanjatin18/AED_Assignment-1/main/Wine_Quality.csv')
df.info()
df.isnull().sum()
VISUALIZING NUMERICAL FEATURES AND OUTLIERS
import seaborn as sns
sns.distplot(df["pH"])
plt.title('pH')
plt.show()
sns.distplot(df['fixed acidity'])
plt.title('fixed acidity')
plt.show()
sns.distplot(df['volatile acidity'])
plt.title('volatile acidity')
plt.show()
sns.boxplot( x=df["pH"] )
plt.show()
sns.boxplot( x=df["fixed acidity"] )
plt.show()
sns.boxplot( x=df["volatile acidity"] )
plt.show()
1. MEAN IMPUTATION METHOD
Mean imputation replaces missing values in the “fixed acidity” feature with the mean of the non-missing values in that feature.
mean_imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')
df1 = pd.read_csv('https://raw.githubusercontent.com/madanjatin18/AED_Assignment-1/main/Wine_Quality.csv')
df1["fixed acidity"] = mean_imputer.fit_transform(df1["fixed acidity"].values.reshape(-1,1))[:,0]
fig, axs = plt.subplots(1, 2, figsize=(16, 6))
sns.histplot(data=df, x='fixed acidity', kde=True, ax=axs[0])
axs[0].set_title('Distribution of Fixed Acidity (Before Imputation)', fontsize=14)
axs[0].set_xlabel('Fixed Acidity', fontsize=12)
axs[0].set_ylabel('Frequency', fontsize=12)
sns.histplot(data=df1, x='fixed acidity', kde=True, ax=axs[1])
axs[1].set_title('Distribution of Fixed Acidity (After KNN Imputation)', fontsize=14)
axs[1].set_xlabel('Fixed Acidity', fontsize=12)
axs[1].set_ylabel('Frequency', fontsize=12)
Note: As we see, “fixed acidity” is in a symmetric distribution. As the data points have outliers it is not recommended to use the mean imputation method. Since the missing values are very less, the mean imputation does not impact. If the dataset has more missing values and outliers mean imputation will impact the ML model. So for now, we’ll go with median or mode imputation methods.
2. MEDIAN IMPUTATION METHOD
Median imputation replaces missing values in the “fixed acidity” feature with the median of the non-missing values in that feature.
median_imputer = SimpleImputer(missing_values=np.NaN, strategy='median')
df2 = pd.read_csv('https://raw.githubusercontent.com/madanjatin18/AED_Assignment-1/main/Wine_Quality.csv')
df2["fixed acidity"] = median_imputer.fit_transform(df2["fixed acidity"].values.reshape(-1,1))[:,0]
fig, axs = plt.subplots(1, 2, figsize=(16, 6))
sns.histplot(data=df, x='fixed acidity', kde=True, ax=axs[0])
axs[0].set_title('Distribution of Fixed Acidity (Before Imputation)', fontsize=14)
axs[0].set_xlabel('Fixed Acidity', fontsize=12)
axs[0].set_ylabel('Frequency', fontsize=12)
sns.histplot(data=df2, x='fixed acidity', kde=True, ax=axs[1])
axs[1].set_title('Distribution of Fixed Acidity (After KNN Imputation)', fontsize=14)
axs[1].set_xlabel('Fixed Acidity', fontsize=12)
axs[1].set_ylabel('Frequency', fontsize=12)
Note: This method is robust to extreme values or outliers and provides a reasonable estimate of the central tendency of the feature. However, median imputation may lead to biased imputations if the feature distribution is highly skewed.
3. MODE IMPUTATION METHOD
Mode imputation replaces missing values in the “fixed acidity” feature with the mode (i.e., most frequent value) of the non-missing values in that feature.
mode_imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
df3 = pd.read_csv('https://raw.githubusercontent.com/madanjatin18/AED_Assignment-1/main/Wine_Quality.csv')
df3["fixed acidity"] = median_imputer.fit_transform(df3["fixed acidity"].values.reshape(-1,1))[:,0]
fig, axs = plt.subplots(1, 2, figsize=(16, 6))
sns.histplot(data=df, x='fixed acidity', kde=True, ax=axs[0])
axs[0].set_title('Distribution of Fixed Acidity (Before Imputation)', fontsize=14)
axs[0].set_xlabel('Fixed Acidity', fontsize=12)
axs[0].set_ylabel('Frequency', fontsize=12)
sns.histplot(data=df3, x='fixed acidity', kde=True, ax=axs[1])
axs[1].set_title('Distribution of Fixed Acidity (After KNN Imputation)', fontsize=14)
axs[1].set_xlabel('Fixed Acidity', fontsize=12)
axs[1].set_ylabel('Frequency', fontsize=12)
Note: Mode imputation is appropriate for categorical or nominal features, where the mean and median may not be meaningful. However, the “fixed acidity” feature is a continuous numerical feature, so mode imputation may not be appropriate.
4. K-NEAREST NEIGHBORS (KNN) IMPUTATION METHOD
KNN imputation replaces missing values in the “fixed acidity” feature with the average value of the K-nearest non-missing values in that feature. This method uses the values of the neighboring observations to impute the missing values, which can be helpful when there is some underlying structure or pattern in the data.
from sklearn.impute import KNNImputer
# create KNN imputer object with k=5
imputer = KNNImputer(n_neighbors=5)
df4 = pd.read_csv('https://raw.githubusercontent.com/madanjatin18/AED_Assignment-1/main/Wine_Quality.csv')
# impute missing values in fixed acidity feature
df4['fixed acidity'] = imputer.fit_transform(df4['fixed acidity'].values.reshape(-1,1))[:,0]
fig, axs = plt.subplots(1, 2, figsize=(16, 6))
sns.histplot(data=df, x='fixed acidity', kde=True, ax=axs[0])
axs[0].set_title('Distribution of Fixed Acidity (Before Imputation)', fontsize=14)
axs[0].set_xlabel('Fixed Acidity', fontsize=12)
axs[0].set_ylabel('Frequency', fontsize=12)
sns.histplot(data=df4, x='fixed acidity', kde=True, ax=axs[1])
axs[1].set_title('Distribution of Fixed Acidity (After KNN Imputation)', fontsize=14)
axs[1].set_xlabel('Fixed Acidity', fontsize=12)
axs[1].set_ylabel('Frequency', fontsize=12)
plt.show()
Note: Here, we created a KNN imputer object with n_neighbors=5, which means that it will use the values of the 5 closest observations to impute a missing value. We then used the fit_transform() method to impute the missing values in the “fixed acidity” feature.
5. HOT DECK IMPUTATION METHOD
Hot deck imputation replaces missing values in the “fixed acidity” feature with the value of a non-missing observation that is most similar to the missing observation. This method can be useful when there are clear patterns in the data, as it can help preserve the underlying structure of the feature.
df5 = pd.read_csv('https://raw.githubusercontent.com/madanjatin18/AED_Assignment-1/main/Wine_Quality.csv')
# sort dataframe by fixed acidity feature
df5 = df5.sort_values(by='fixed acidity')
# use forward fill method to impute missing values
df5['fixed acidity'] = df5['fixed acidity'].fillna(method='ffill')
fig, axs = plt.subplots(1, 2, figsize=(16, 6))
sns.histplot(data=df, x='fixed acidity', kde=True, ax=axs[0])
axs[0].set_title('Distribution of Fixed Acidity (Before Imputation)', fontsize=14)
axs[0].set_xlabel('Fixed Acidity', fontsize=12)
axs[0].set_ylabel('Frequency', fontsize=12)
sns.histplot(data=df5, x='fixed acidity', kde=True, ax=axs[1])
axs[1].set_title('Distribution of Fixed Acidity (After KNN Imputation)', fontsize=14)
axs[1].set_xlabel('Fixed Acidity', fontsize=12)
axs[1].set_ylabel('Frequency', fontsize=12)
plt.show()
Note: Hot deck imputation assumes that observations with missing values are similar to those with non-missing values, so it may not work well if there is a large amount of missing data or if the missing values are not missing at random.
CONCLUSION
In conclusion, data imputation for numerical features is an important step in the data preprocessing phase. It involves filling in missing values in a dataset using various methods such as mean, median, mode, K-nearest neighbors (KNN), and hot deck imputation as shown above.
The choice of imputation technique largely depends on the nature of the missing data and the underlying structure of the dataset.
It is important to carefully evaluate the performance of each imputation method and select the most appropriate method for the dataset at hand.
REFERENCES
- Effective Strategies for Handling Missing Values in Data Analysis
- Impute Missing Values
- Null and missing data in Python
- Imputing Numerical Data
LICENSE
Copyright 2023 Jatin Madan
All code in this notebook is available as open source through the MIT license.
All text and images are free to use under the Creative Commons Attribution 3.0 license. https://creativecommons.org/licenses/by/3.0/us/
These licenses let people distribute, remix, tweak, and build upon the work, even commercially, as long as they give credit for the original creation.
Copyright 2023 AI Skunks https://github.com/aiskunks
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.