How to Handle Missing Data | Data Cleaning | Exploratory Data Analysis

Missing data means a value that is not stored for a variable in a set of data. Handling missing data is a critical step in data cleaning and can significantly impact the outcomes of data analysis.

First, we have to identify missing Data:

Summary Statistics: Use functions like isnull() or info() pandas to get a summary of missing values. A DataFrame method that returns a concise summary of the dataframe, including a ‘non-null count,’ which helps you know the number of missing values.

pd.isna() / pd.isnull()- A pandas function that returns a same-sized boolean array indicating whether each value is null (pd.isnull() can also be used).

pd.notnull(): A pandas function that returns a same-sized Boolean array indicating whether each value is NOT null.

import pandas as pd

# Sample DataFrame with missing values
data = {
'ID': [1, 2, 3, 4, 5],
'Grade': ['A', None, 'B', None, 'C'],
'Age': [25, 30, None, 35, None],
'Name': ['John', None, 'Alice', 'Bob', None]
df = pd.DataFrame(data)

# Display the original DataFrame

# Display concise summary of the DataFrame using

# Identify missing values using pd.isna()
missing_values = pd.isna(df)

# Alternatively, using pd.isnull() (which is the same as pd.isna())
missing_values_alt = pd.isnull(df)

m_values= pd.notnull(df)

Handling Missing Data

i. Deletion:

Listwise Deletion: Remove rows with any missing values.

df.dropna(): A DataFrame method that removes rows or columns that contain missing values, depending on the axis you specify.


# Drop rows with any missing values
df_dropna_rows = df.dropna()

# Drop columns with any missing values
df_dropna_columns = df.dropna(axis=1)

ii. Imputation

Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the column.

df['column'].fillna(df['column'].mean(), inplace=True)

# Mean Imputation for numerical data
df_mean_imputed = df.copy()
df_mean_imputed['Age'].fillna(df['Age'].mean(), inplace=True)


# Median Imputation for numerical data
df_median_imputed = df.copy()
df_median_imputed['Age'].fillna(df['Age'].median(), inplace=True)

# Mode Imputation for numerical and categorical data
df_mode_imputed = df.copy()
df_mode_imputed['Grade'].fillna(df['Grade'].mode()[0], inplace=True)
df_mode_imputed['Age'].fillna(df['Age'].mode()[0], inplace=True)
df_mode_imputed['Name'].fillna(df['Name'].mode()[0], inplace=True)


iii. Create a NAN category:

import pandas as pd

# Fill missing values with a specific category or value
df_fill_nan = df.fillna({
'Grade': 'NaN', # Fill missing 'Grade' with 'NaN'
'Age': -1, # Fill missing 'Age' with -1
'Name': 'Unknown' # Fill missing 'Name' with 'Unknown'

# Print the resulting DataFrame

iv. Forward filling, backward filling: We can also derive new representative values — Forward filling, backward filling.

df.fillna(): A DataFrame method that fills in missing values using specified method.

# Forward fill missing values
df_ffill = df.fillna(method='ffill')

# Backward fill missing values
df_bfill = df.fillna(method='bfill')

v. Interpolation: Estimating missing values based on the values of surrounding data points.

import pandas as pd
import numpy as np

# Create the dataset
data = {
'Day': [1, 2, 3, 4, 5, 6, 7],
'Temperature (°C)': [22.0, 21.5, np.nan, 23.0, 24.0, np.nan, 25.0]

# Create a DataFrame
df = pd.DataFrame(data)

# Perform linear interpolation
df['Temperature (°C)'] = df['Temperature (°C)'].interpolate(method='linear')

# Display the DataFrame

v. Model-Based Imputation: Use for complex datasets where feature relationships can be leveraged. Model-based imputation uses machine learning models to predict and fill in missing values based on other features in the dataset.

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Create the dataset
data = {
'Day': [1, 2, 3, 4, 5, 6, 7],
'Temperature (°C)': [22.0, 21.5, np.nan, 23.0, 24.0, np.nan, 25.0]

# Create a DataFrame
df = pd.DataFrame(data)

# Separate the known and unknown values
known = df.dropna()
unknown = df[df['Temperature (°C)'].isna()]

# Prepare the training data
X_train = known[['Day']]
y_train = known['Temperature (°C)']

# Prepare the test data (the rows with missing values)
X_test = unknown[['Day']]

# Initialize and train the model
model = LinearRegression(), y_train)

# Predict the missing values
predictions = model.predict(X_test)

# Fill in the missing values
df.loc[df['Temperature (°C)'].isna(), 'Temperature (°C)'] = predictions

# Display the DataFrame

vi. KNN Imputation: Use for smaller datasets where similar instances can inform missing values.

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

# Create the dataset
data = {
'Day': [1, 2, 3, 4, 5, 6, 7],
'Temperature (°C)': [22.0, 21.5, np.nan, 23.0, 24.0, np.nan, 25.0]

# Create a DataFrame
df = pd.DataFrame(data)

# Initialize the KNN Imputer with the number of neighbors
imputer = KNNImputer(n_neighbors=2)

# Perform KNN imputation
df[['Temperature (°C)']] = imputer.fit_transform(df[['Temperature (°C)']])

# Display the DataFrame

By choosing the appropriate method based on your specific dataset and analysis requirements, you can handle missing data effectively and improve the quality of your analysis.

