HANDLING MISSING DATA

Amshalakoushal
AI Skunks
Published in
8 min readMar 11, 2023

Modern businesses and research rely heavily on data, which is crucial for their functioning. However, obtaining a dataset that is complete and error-free is a rarity. The presence of missing data is a common issue that can greatly affect the accuracy and dependability of the results. Nevertheless, there are various techniques available that can be employed to manage missing data and enhance the quality of analysis. This article delves into some of these techniques and offers suggestions to data analysts on how to handle missing data efficiently.

What is Missing Data?

Missing data refers to any value that is not present in a dataset. There are several potential causes, including human error, incorrect data entry, and system malfunctions. Absence of observations in columns indicates. Missing data. It can have values like “0”, “NA”, “NaN”, “NULL”, “Not applicable” and “None”.

For instance, if you’re analyzing a customer survey, some respondents may have skipped questions, or their responses might have been lost during data collection.

APPROACHES OF HANDLING MISSING DATA

In data analysis, handling missing data is crucial since it might impact the validity and accuracy of the findings. Examining the data carefully and identifying all the missing values is the first step in addressing missing values. We utilize functions like isnull() and notnull() to verify whether missing data is true or not.

Examining the impact of the chosen technique on the analysis is crucial when dealing with missing data. Additionally, it is vital to keep track of the quantity and proportion of missing data for each variable because this might influence the technique selection. It is also necessary to remember that missing data may not always need to be filled in. When there are only a few missing data points, it is sometimes sufficient to exclude the offending rows and c columns from the analysis.

The choice of technique will depend on the nature of the data and the goals of the analysis. Some common techniques include:

DELETION: This method involves removing any observations or variables that contain missing values. There are different types of deletion methods:

List-wise Deletion: This method involves removing all cases that have missing data on any variable included in the analysis. This means that if a single missing value exists in any variable, the entire observation will be removed from the data set. List-wise deletion ensures that only complete cases are analyzed. However, it can result in a loss of valuable information and reduction in sample size, which can impact the statistical power of the analysis.

Pair-wise Deletion: Pairwise deletion, sometimes referred to as available-case analysis, is a technique for dealing with missing data that involves examining just those instances or observations that include full data for each relevant variable. Because this approach does not need entire data for every variable in the data set, it is less constrained than list-wise deletion.

Dropping variables: This refers to the practice of removing entire columns or variables from a dataset that have a high percentage of missing values. This is a common data preprocessing technique that is used to clean up and prepare datasets for analysis.

Here is the melbourne housing dataset.

import pandas as pd

# Load the dataset
melbourne_data = pd.read_csv("/content/Melbourne_housing_FULL.csv")

# Check the number of null values in each column
print(melbourne_data.isnull().sum())
Suburb               0
Address 0
Rooms 0
Type 0
Price 7610
Method 0
SellerG 0
Date 0
Distance 1
Postcode 1
Bedroom2 8217
Bathroom 8226
Car 8728
Landsize 11810
BuildingArea 21115
YearBuilt 19306
CouncilArea 3
Lattitude 7976
Longtitude 7976
Regionname 3
Propertycount 3
dtype: int64

As we can see from the output, several columns have missing values. Let’s analyze each column and decide how to handle the missing values.

Price: This column is the target variable, so we cannot remove missing values from it. Instead, we can remove the rows where the price is missing.

df = df.dropna(subset=['Price'])

Bedroom2, Bathroom, Car, Landsize, BuildingArea, YearBuilt, Lattitude, Longtitude: These columns have a large number of missing values. We cannot remove all these rows as it will significantly reduce the dataset size. Instead, we can impute missing values. For example, we can use the mean value to impute the missing values in the numerical columns and use the mode value to impute the missing values in categorical columns.

# Impute missing values with mean for numerical columns
df['Bedroom2'].fillna(df['Bedroom2'].mean(), inplace=True)
df['Bathroom'].fillna(df['Bathroom'].mean(), inplace=True)
df['Car'].fillna(df['Car'].mean(), inplace=True)
df['Landsize'].fillna(df['Landsize'].mean(), inplace=True)
df['BuildingArea'].fillna(df['BuildingArea'].mean(), inplace=True)
df['YearBuilt'].fillna(df['YearBuilt'].mean(), inplace=True)
df['Lattitude'].fillna(df['Lattitude'].mean(), inplace=True)
df['Longtitude'].fillna(df['Longtitude'].mean(), inplace=True)
df['Distance'].fillna(df['Distance'].mean(), inplace=True)
df['Postcode'].fillna(df['Postcode'].mean(), inplace=True)

# Impute missing values with mode for categorical columns
df['CouncilArea'].fillna(df['CouncilArea'].mode()[0], inplace=True)
df = df.drop(['Bedroom2'], axis=1)
print(df.isnull().sum())
Suburb           0
Address 0
Rooms 0
Type 0
Price 0
Method 0
SellerG 0
Date 0
Distance 0
Postcode 0
Bedroom2 0
Bathroom 0
Car 0
Landsize 0
BuildingArea 0
YearBuilt 0
CouncilArea 0
Lattitude 0
Longtitude 0
Regionname 0
Propertycount 0
dtype: int64

IMPUTATION: Imputation entails substituting estimates for missing values in data with the help of the available information. Mean imputation, median imputation, regression imputation, and multiple imputation are examples of common imputation techniques.

Making accurate estimations of the missing values based on the patterns and correlations found in the data is the aim of imputation. There are several imputation techniques that may be applied, such as:

Here is the califorina housing dataset.

let’s load the dataset and create some missing values

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

# Load the California Housing Prices dataset
california_housing = fetch_california_housing(as_frame=True)
df = california_housing.frame

# Create some missing values
df.iloc[2:10, 0] = np.nan
df.iloc[20:30, 3] = np.nan
df.iloc[100:120, 5] = np.nan

# Display the first few rows of the dataset
print(df.head())
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 NaN 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 NaN 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 NaN 52.0 6.281853 1.081081 565.0 2.181467 37.85

Longitude MedHouseVal
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422

As we can see, there are missing values in the dataset. Let’s try some different imputation techniques to fill in these missing values.

Mean imputation: This technique includes substituting the available data’s mean for the variable’s missing values.

Here is an example for Mean Imputation.

Fill in missing values with the mean value of the corresponding column. We can do this using the SimpleImputer class from scikit-learn

from sklearn.impute import SimpleImputer

# Create a mean imputer
mean_imputer = SimpleImputer(strategy='mean')

# Impute missing values with mean imputation
imputed_df = pd.DataFrame(mean_imputer.fit_transform(df), columns=df.columns)

# Display the first few rows of the imputed dataset
print(imputed_df.head())
     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0 8.325200 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.301400 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 3.870556 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 3.870556 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.870556 52.0 6.281853 1.081081 565.0 2.181467 37.85

Longitude MedHouseVal
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422

Median Imputation: This method involves replacing the missing values with the median of the available data for that variable.

Another simple imputation technique is to fill in missing values with the median value of the corresponding column. We can do this using the SimpleImputer class from scikit-learn, just like we did with mean imputation, but with the strategy parameter set to ‘median’

# Create a median imputer
median_imputer = SimpleImputer(strategy='median')

# Impute missing values with median imputation
imputed_df = pd.DataFrame(median_imputer.fit_transform(df), columns=df.columns)

# Display the first few rows of the imputed dataset
print(imputed_df.head())
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 3.5346 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 3.5346 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.5346 52.0 6.281853 1.081081 565.0 2.181467 37.85

Longitude MedHouseVal
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422

Regression imputation: This technique uses a regression model to forecast the values of variables that are missing based on the other variables in the dataset.

Regression imputation is a more advanced technique that involves predicting missing values using a regression model. We can use scikit-learn’s IterativeImputer class to perform regression imputation

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

# Create a regression imputer using a Random Forest model
regression_imputer = IterativeImputer(estimator=RandomForestRegressor(), max_iter=10, random_state=0)

# Impute missing values with regression imputation
imputed_df = pd.DataFrame(regression_imputer.fit_transform(df), columns=df.columns)

# Display the first few rows of the imputed dataset
print(imputed_df.head())
     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0 8.325200 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.301400 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.900784 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 4.424026 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 5.400203 52.0 6.281853 1.081081 565.0 2.181467 37.85

Longitude MedHouseVal
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422

Multiple imputation: This method involves creating multiple imputed datasets based on different imputation models and combining the results to obtain more accurate estimates of the missing values.

Imputation provides a number of benefits, including:

1.By giving thorough data, it can increase the analysis’s accuracy.

2.It can lessen any bias that can result from missing data deletion.

3.By employing more complete data, it can boost the statistical power of the study.

However, imputation also has some limitations, including:

A.Imputation assumes that the patterns of missing data are similar to those of the observed data, which may not always be the case.

B.Imputation can distort the variability and correlation of the data if the imputed values are not accurately estimated.

INTERPOLATION:Using nearby data points as a guide, interpolation involves guessing the values of missing data. When working with geographical or time-series data, this strategy is frequently utilized.

TIPS FOR HANDLING MISSING DATA:

  • Recognize the type of missing data: Knowledge of the cause and pattern of missing data is crucial before using any strategy. This information can help you select an appropriate method and avoid bias.
  • Check for missing ness: Always check your dataset for missing ness before analysis. If you don’t know which variables are missing, you can’t handle them effectively.
  • It’s crucial to understand the nature and pattern of missing data in your dataset. This information can help you choose the appropriate imputation technique and evaluate the impact of missing data on your results.
  • Data analysts can use descriptive statistics and visualization techniques to identify the missing data pattern and determine the appropriate handling method.
  • When addressing missing data, it’s critical to take the effects of imputation into account. Thus, it’s crucial to assess the influence of imputation on your results and present it in your analysis. Imputing missing data might bring uncertainty and bias into your research.
  • To increase the validity of your results, it’s a good practice to use multiple techniques to handle missing data and compare their results. This approach can help you identify the most suitable method for your data and increase the reliability of your analysis.

Managing missing data is a crucial stage in any data analysis effort, to sum up. Data analysts may reduce the influence of missing data on their conclusions and assure the validity and reliability of their research by using the right procedures and heeding these recommendations.

--

--