Handling Missing Data in Pandas Dataframes

Handling Missing Data in Pandas Dataframes

Punyakeerthi BL
3 min readJun 24, 2024

--

Before proceeding with this article, please read the following for continuation:

DataFrames : Handling Missing Values in Pandas

Have you ever worked with a dataset that has missing information? These missing values, often represented as NaN (Not a Number), can cause problems when analyzing data in tools like Pandas. This article will guide you through different ways to address null values in Pandas Dataframes, making your data analysis smoother.

Understanding Null Values

Imagine a class where some students skipped a test. Their marks for that test would be missing in the dataset. These missing marks are null values. While there are various reasons for null values, it’s important to handle them before analyzing the data.

df.fillna() in pandas:

Purpose:

  • Deals with missing values (represented as NaN or None) in a pandas DataFrame (df).
  • Replaces these missing values with other values you specify.

How it Works:

  1. Calling df.fillna(): You invoke this method on your DataFrame (df).
  2. Specifying Values: You can provide a value (like 0, a string, or another calculation) to fill all missing values with the same replacement. Alternatively, you can use a dictionary, Series, or DataFrame to assign different replacements to specific columns or index labels.
  3. Filling the Holes: df.fillna() iterates through the DataFrame, identifying missing values and replacing them according to your instructions.

Optional Parameters:

  • method: Controls how to fill gaps with existing data (ffill for forward-fill, bfill/backfill for backward-fill).
  • axis: Specifies whether to fill by rows (0 or 'index') or columns (1 or 'columns').
  • inplace: If True, modifies the original DataFrame (df). If False (default), returns a new DataFrame with the missing values filled.

Explanation of all parameters in df.fillna() with code examples:

Parameters:

  1. value (scalar, optional):
  • A single value to fill all missing values in the DataFrame.
import pandas as pd

data = {'col1': [1, None, 3], 'col2': ['a', None, 'c']}
df = pd.DataFrame(data)

# Fill all missing values with 0
df_filled = df.fillna(0)
print(df_filled)

This will output:

col1 col2
0 1.0 a
1 0.0 NaN
2 3.0 c

2.method (str, optional):

  • Specifies how to fill gaps with existing values in the DataFrame:
  • 'ffill' (default for axis=0): Propagates the last valid observation forward to fill the next missing value(s).
  • 'bfill' or 'backfill': Fills missing values by propagating the next valid observation backward.

Example (using ffill):

# Fill missing values with column mean (forward-fill)
df_filled = df.fillna(method='ffill')
print(df_filled)

This will output:

col1 col2
0 1.0 a
1 1.0 NaN
2 3.0 c

3.axis (int or str, default=None):

  • Along which axis to fill missing values:
  • 0 or ‘index’ (default): Fills missing values by row (vertically).
  • 1 or ‘columns’ (less common): Fills missing values by column (horizontally).

Example (using axis=1):

# Fill missing values in 'col2' with the column mean (forward-fill)
df_filled = df.fillna(method='ffill', axis=1)
print(df_filled)

This will output:

col1 col2
0 1.0 a
1 1.0 b # 'b' is the mean of 'a' and 'c'
2 3.0 c

4.inplace (bool, default=False):

  • Whether to modify the original DataFrame (df):
  • True: Modifies df in-place.
  • False (default): Returns a new DataFrame with the missing values filled.

Example (using inplace=True):

# Fill all missing values with the column mean (forward-fill) in-place 
df.fillna(method='ffill', inplace=True)
print(df)
  • This will directly modify df and print the filled DataFrame.

5.limit (int, optional):

  • When using method='ffill' or method='bfill', this specifies the maximum number of consecutive NaNs to forward/backward fill.
  • If a gap has more NaNs than limit, it will only be partially filled.

Example (limiting forward-fill to 1):

# Fill missing values with the column mean (forward-fill, limit 1)
df_filled = df.fillna(method='ffill', limit=1)
print(df_filled)

This will output:

col1 col2
0 1.0 a
1 1.0 NaN # Limit of 1 reached, so 'NaN' remains
2 3.0 c

6.downcast (dict, optional):

  • A dictionary mapping data types to reduce memory usage. Pandas tries to convert the filled DataFrame to these types.

Remember to choose the appropriate parameters based on your data and analysis goals. Consider the potential impact of filling missing values on your results, especially if the missingness is not random.

Key Points:

  • df.fillna() is a versatile tool for handling missing data in pandas DataFrames.
  • Choose the appropriate replacement strategy based on your data and analysis needs.
  • Consider the potential impact of filling missing values on your results, especially if the missingness is not random.

--

--

Punyakeerthi BL

Passionate Learner in #GenerativeAI|Python| Micro-Service |Springboot | #GenerativeAILearning Talks about #GenerativeAI,#promptengineer, #Microservices