DataFrames : Handling Missing Values in Pandas

Punyakeerthi BL
4 min readJun 24, 2024

Before proceeding with this article, please read the following for continuation:

Unveiling DataFrame: A Look at Essential Pandas Functions

This article explores how to handle missing values, also known as null values, in pandas DataFrames. Missing values can disrupt data analysis and machine learning tasks. Fortunately, pandas offers several methods to address them.

Why Handle Missing Values?

Missing values can lead to errors in calculations and skew analysis results. For instance, calculating the average with missing grades wouldn’t provide an accurate picture of student performance. Similarly, machine learning models often struggle with datasets containing missing data.

Approaches to Handling Missing Values in Pandas

Here are three common approaches to dealing with missing values in pandas:

Dropping Rows/Columns:

  • This is a straightforward method but can lead to data loss.
  • Use df.dropna() to drop rows with any missing values.
import pandas as pd

# Sample DataFrame with missing values
data = {'Roll No': [1, 2, 3, 4, 5],
'Physics': [90, None, 85, 70, 60],
'Chemistry': [80, 75, None, 90, 85],
'Maths': [None, 65, 78, 82, 95],
'Computers': [None, 80, 72, 88, None]}
df = pd.DataFrame(data)

# Drop rows with missing values
df2 = df.dropna()
print(df2)

This code creates a DataFrame df with missing values and then uses df.dropna() to create a new DataFrame df2 containing only rows with complete data.

Dropping Columns:

  • This approach removes entire columns with missing values.
  • Use df.drop() with axis=1 to drop columns.
# Drop columns with missing values
df3 = df.drop(df.columns[df.isna().any()], axis=1)
print(df3)
  • This code utilizes df.isna().any() to identify columns with missing values and then drops those columns using df.drop() with axis=1 to create DataFrame df3.

Filling Missing Values (to be covered in next article):

  • This strategy replaces missing values with appropriate substitutes.
  • Different techniques exist, such as replacing with mean, median, or carrying forward/backward values.

Choosing the Right Approach:

The best approach depends on the specific dataset and analysis goals. Dropping rows or columns might be suitable if the amount of missing data is minimal. However, if significant data loss is undesirable, imputation techniques for filling missing values become necessary (covered in a future video).

This article provides a foundational understanding of handling missing values in pandas. Stay tuned for the next article, which will delve deeper into imputation methods for filling missing data effectively!

Handling Missing Values in Pandas DataFrames with dropna()

In data analysis using Python’s Pandas library, you’ll often encounter datasets containing missing values, represented by NaN (Not a Number). The dropna() method in Pandas provides a powerful tool to clean your data by removing these missing values.

Example Program and Output:

import pandas as pd
# Assuming the sample.csv file is in the specified location
df = pd.read_csv("C:/Users/PANDAS_DATA/pandastutorial-main/pandastutorial-main/Datasets/sample.csv")

# Preview the first few rows
print(df.head())

# Check for missing values
print(df.isnull())

# Count missing values per column
print(df.isnull().sum())

# Total number of missing values (all columns combined)
total_missing = df.isnull().sum().sum()
print("Total missing values:", total_missing)

# Original DataFrame shape
print(df.shape)

# Drop rows with any missing values (default behavior)
df2 = df.dropna()
print(df2.shape) # Reduced number of rows

# Drop columns with any missing values
df3 = df.dropna(axis=1)
print(df3.shape) # Potentially reduced number of columns

# Drop rows only if all values are missing
df4 = df.dropna(how='all')
print(df4.shape) # May have the same number of rows if no all-NaN rows exist

# Modify the original DataFrame inplace (avoid creating copies)
df.dropna(inplace=True)
print(df.shape) # Reduced number of rows (same as df2)

Explanation:

  • Import Pandas:
  • import pandas as pd imports the Pandas library and assigns the alias pd for convenience.
  • Read CSV Data:
  • df = pd.read_csv("C:/Users/PANDAS_DATA/pandastutorial-main/pandastutorial-main/Datasets/sample.csv") reads the CSV file into a DataFrame named df, assuming the specified path exists.
  • Previewing Data:
  • print(df.head()) displays the first few rows of df to get a glimpse of the data.
  • Identifying Missing Values:
  • print(df.isnull()) creates a DataFrame showing Boolean values (True/False) indicating where missing values (NaN) exist.
  • Counting Missing Values per Column:
  • print(df.isnull().sum()) calculates the number of missing values in each column and displays the results.
  • Total Missing Values:
  • total_missing = df.isnull().sum().sum() computes the total number of missing values across all columns in df.
  • Original DataFrame Shape:
  • print(df.shape) displays the original dimensions (number of rows, columns) of df.
  • Dropping Rows with Any Missing Values:
  • df2 = df.dropna() creates a new DataFrame df2 by removing rows containing any missing values (default behavior).
  • print(df2.shape) shows the potentially reduced number of rows in df2.
  • Dropping Columns with Any Missing Values:
  • df3 = df.dropna(axis=1) creates df3 by dropping columns with at least one missing value.
  • print(df3.shape) displays the potentially reduced number of columns in df3.
  • Dropping Rows Only if All Values Are Missing:
  • df4 = df.dropna(how='all') creates df4 by dropping rows where all values are missing (NaN).
  • print(df4.shape) shows the shape of df4, which might be the same as df if no rows have all-NaN values.
  • Modifying Inplace:
  • df.dropna(inplace=True) directly modifies df to remove rows with missing values. This avoids creating copies.
  • print(df.shape) displays the updated shape of df, which will be the same as df2.

Key Points:

  • dropna() offers flexibility in handling missing values by specifying axis (rows or columns) and how (any or all missing values per row).
  • Consider the implications of data loss when dropping rows or

If you like this post please follow me on Linked In: Punyakeerthi BL

--

--

Punyakeerthi BL

Passionate Learner in #GenerativeAI|Python| Micro-Service |Springboot | #GenerativeAILearning Talks about #GenerativeAI,#promptengineer, #Microservices