Day 2 — Mastering Pandas: Filling Missing Data and Filtering with Confidence

Sesha Sai
3 min readDec 11, 2024

--

Hello, readers! 👋 On my second day of delving deeper into the world of Python and Pandas, I encountered some key real-world challenges in data manipulation. One of the most important skills in the journey of data analysis is understanding how to handle missing data and filter datasets efficiently. Today, I explored some practical methods and nuances that I’d love to share with you!

Day 2 with pandas

Tackling Missing Data Like a Pro

Data is rarely perfect. Missing values are common and can disrupt analyses if not handled correctly. Here’s what I learned today about detecting, filling, and dropping missing values:

  1. Detecting Missing Values

Use .isna() or .isnull() to identify NaN values.

.sum() helps count the number of missing values in each column.

print(df.isna().sum())

2.Filling Missing Values

Mean Imputation: Replace missing values with the column mean.

df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

Mode Imputation: Replace missing categorical values with the most frequent value.

df['Department'] = df['Department'].fillna(df['Department'].mode()[0])

Custom Default: Fill specific columns with a default value.

df['Department'] = df['Department'].fillna("Unknown")

3. Dropping Missing Data

  • Remove rows with missing values:
df.dropna()

Remove columns with missing values:

df.dropna(axis=1)

Filtering Data with Precision

Once the missing data was handled, I explored ways to filter datasets for specific conditions. These are some of the filtering techniques I practiced:

  1. Filter Rows by Condition
  • Retrieve rows where Salary is greater than 60,000:
df_salary = df[df['Salary'] > 60000]

2.Select Specific Columns:

Extract only the Employee and Salary columns:

df[['Employee', 'Salary']]

3.Chaining Conditions:
Filter rows for employees in the IT department with a Salary above 50,000:

df_filtered = df[(df['Department'] == 'IT') & (df['Salary'] > 50000)]

Debugging and Common Pitfalls

Today, I encountered some errors, and I learned how to debug them:

  1. Key Error for Missing Columns
  • Ensure column names are exact and free of trailing spaces. Use .columns to inspect column names:
print(df.columns)

2. Single vs. Double Brackets

  • This confusion happens because Pandas treats single square brackets ([ ]) differently depending on the context:
  • For single column selection, single brackets are fine.
  • For multiple columns, you need a list inside brackets.
df[['Employee', 'Salary']] ## Right way to select 2 or more columns
df['Employee', 'Salary'] ## wrong way to select multiple columns

Takeaways from Day 2

Today, I realized the critical role of clean data in analysis. Without properly handling missing values or applying accurate filters, insights can easily become skewed or misleading. Thankfully, Pandas provides intuitive tools that simplify these tasks, making data preparation a seamless process once you understand the basics. From detecting and filling missing values to filtering datasets with precision, I’ve gained confidence in dealing with messy data scenarios.

Moving forward, I’ll be exploring grouping and aggregation to uncover deeper insights from datasets and experimenting with advanced filtering techniques for more complex queries. If you’re on a similar journey or have questions about Pandas, feel free to connect in the comments. Let’s continue to learn and grow together! 🚀

--

--

No responses yet