How to Handle Missing Data | Data Cleaning | Exploratory Data Analysis

4 min readJun 28, 2024

Missing data means a value that is not stored for a variable in a set of data. Handling missing data is a critical step in data cleaning and can significantly impact the outcomes of data analysis.

First, we have to identify missing Data:

Summary Statistics: Use functions like isnull() or info() pandas to get a summary of missing values.

df.info()- A DataFrame method that returns a concise summary of the dataframe, including a ‘non-null count,’ which helps you know the number of missing values.

pd.isna() / pd.isnull()- A pandas function that returns a same-sized boolean array indicating whether each value is null (pd.isnull() can also be used).

pd.notnull(): A pandas function that returns a same-sized Boolean array indicating whether each value is NOT null.

import pandas as pd

# Sample DataFrame with missing values
data = {
    'ID': [1, 2, 3, 4, 5],
    'Grade': ['A', None, 'B', None, 'C'],
    'Age': [25, 30, None, 35, None],
    'Name': ['John', None, 'Alice', 'Bob', None]
}
df = pd.DataFrame(data)

# Display the original DataFrame
print(df)

# Display concise summary of the DataFrame using df.info()
df.info()

# Identify missing values using pd.isna()
missing_values = pd.isna(df)
print(missing_values)

# Alternatively, using pd.isnull() (which is the same as pd.isna())
missing_values_alt = pd.isnull(df)
print(missing_values_alt)
print(df)

m_values= pd.notnull(df)
print(m_values)

Handling Missing Data

i. Deletion:

Listwise Deletion: Remove rows with any missing values.

df.dropna(): A DataFrame method that removes rows or columns that contain missing values, depending on the axis you specify.

df.dropna(inplace=True)

# Drop rows with any missing values
df_dropna_rows = df.dropna()
print(df_dropna_rows)

# Drop columns with any missing values
df_dropna_columns = df.dropna(axis=1)
print(df_dropna_columns)

ii. Imputation

Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the column.

df['column'].fillna(df['column'].mean(), inplace=True)


# Mean Imputation for numerical data
df_mean_imputed = df.copy()
df_mean_imputed['Age'].fillna(df['Age'].mean(), inplace=True)

print(df_mean_imputed)

# Median Imputation for numerical data
df_median_imputed = df.copy()
df_median_imputed['Age'].fillna(df['Age'].median(), inplace=True)
print(df_median_imputed)

# Mode Imputation for numerical and categorical data
df_mode_imputed = df.copy()
df_mode_imputed['Grade'].fillna(df['Grade'].mode()[0], inplace=True)
df_mode_imputed['Age'].fillna(df['Age'].mode()[0], inplace=True)
df_mode_imputed['Name'].fillna(df['Name'].mode()[0], inplace=True)

print(df_mode_imputed)

iii. Create a NAN category:

import pandas as pd

# Fill missing values with a specific category or value
df_fill_nan = df.fillna({
    'Grade': 'NaN',   # Fill missing 'Grade' with 'NaN'
    'Age': -1,        # Fill missing 'Age' with -1
    'Name': 'Unknown' # Fill missing 'Name' with 'Unknown'
})

# Print the resulting DataFrame
print(df_fill_nan)

iv. Forward filling, backward filling: We can also derive new representative values — Forward filling, backward filling.

df.fillna(): A DataFrame method that fills in missing values using specified method.

# Forward fill missing values
df_ffill = df.fillna(method='ffill')
print(df_ffill)

# Backward fill missing values
df_bfill = df.fillna(method='bfill')
print(df_bfill)

v. Interpolation: Estimating missing values based on the values of surrounding data points.

import pandas as pd
import numpy as np

# Create the dataset
data = {
    'Day': [1, 2, 3, 4, 5, 6, 7],
    'Temperature (°C)': [22.0, 21.5, np.nan, 23.0, 24.0, np.nan, 25.0]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Perform linear interpolation
df['Temperature (°C)'] = df['Temperature (°C)'].interpolate(method='linear')

# Display the DataFrame
print(df)

v. Model-Based Imputation: Use for complex datasets where feature relationships can be leveraged. Model-based imputation uses machine learning models to predict and fill in missing values based on other features in the dataset.

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Create the dataset
data = {
    'Day': [1, 2, 3, 4, 5, 6, 7],
    'Temperature (°C)': [22.0, 21.5, np.nan, 23.0, 24.0, np.nan, 25.0]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Separate the known and unknown values
known = df.dropna()
unknown = df[df['Temperature (°C)'].isna()]

# Prepare the training data
X_train = known[['Day']]
y_train = known['Temperature (°C)']

# Prepare the test data (the rows with missing values)
X_test = unknown[['Day']]

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the missing values
predictions = model.predict(X_test)

# Fill in the missing values
df.loc[df['Temperature (°C)'].isna(), 'Temperature (°C)'] = predictions

# Display the DataFrame
print(df)

vi. KNN Imputation: Use for smaller datasets where similar instances can inform missing values.

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

# Create the dataset
data = {
    'Day': [1, 2, 3, 4, 5, 6, 7],
    'Temperature (°C)': [22.0, 21.5, np.nan, 23.0, 24.0, np.nan, 25.0]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Initialize the KNN Imputer with the number of neighbors
imputer = KNNImputer(n_neighbors=2)

# Perform KNN imputation
df[['Temperature (°C)']] = imputer.fit_transform(df[['Temperature (°C)']])

# Display the DataFrame
print(df)

By choosing the appropriate method based on your specific dataset and analysis requirements, you can handle missing data effectively and improve the quality of your analysis.

Blogs Related to Data Cleaning:

Complete Data Science Roadmap.

Complete EDA explanation in youtube for free.

Give it :👏👏👏👏:
If you found this guide helpful , why not show some love? Give it a Clap 👏, and if you have questions or topics you’d like to explore further, drop a comment 💬 below 👇. If you appreciate my hard work please follow me. That is the only way I can continue my passion.

How to Handle Missing Data | Data Cleaning | Exploratory Data Analysis

Written by Rina Mondal