Day 2 — Mastering Pandas: Filling Missing Data and Filtering with Confidence
Hello, readers! 👋 On my second day of delving deeper into the world of Python and Pandas, I encountered some key real-world challenges in data manipulation. One of the most important skills in the journey of data analysis is understanding how to handle missing data and filter datasets efficiently. Today, I explored some practical methods and nuances that I’d love to share with you!
Tackling Missing Data Like a Pro
Data is rarely perfect. Missing values are common and can disrupt analyses if not handled correctly. Here’s what I learned today about detecting, filling, and dropping missing values:
- Detecting Missing Values
Use .isna()
or .isnull()
to identify NaN
values.
.sum()
helps count the number of missing values in each column.
print(df.isna().sum())
2.Filling Missing Values
Mean Imputation: Replace missing values with the column mean.
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
Mode Imputation: Replace missing categorical values with the most frequent value.
df['Department'] = df['Department'].fillna(df['Department'].mode()[0])
Custom Default: Fill specific columns with a default value.
df['Department'] = df['Department'].fillna("Unknown")
3. Dropping Missing Data
- Remove rows with missing values:
df.dropna()
Remove columns with missing values:
df.dropna(axis=1)
Filtering Data with Precision
Once the missing data was handled, I explored ways to filter datasets for specific conditions. These are some of the filtering techniques I practiced:
- Filter Rows by Condition
- Retrieve rows where
Salary
is greater than 60,000:
df_salary = df[df['Salary'] > 60000]
2.Select Specific Columns:
Extract only the Employee
and Salary
columns:
df[['Employee', 'Salary']]
3.Chaining Conditions:
Filter rows for employees in the IT
department with a Salary
above 50,000:
df_filtered = df[(df['Department'] == 'IT') & (df['Salary'] > 50000)]
Debugging and Common Pitfalls
Today, I encountered some errors, and I learned how to debug them:
- Key Error for Missing Columns
- Ensure column names are exact and free of trailing spaces. Use
.columns
to inspect column names:
print(df.columns)
2. Single vs. Double Brackets
- This confusion happens because Pandas treats single square brackets (
[ ]
) differently depending on the context: - For single column selection, single brackets are fine.
- For multiple columns, you need a list inside brackets.
df[['Employee', 'Salary']] ## Right way to select 2 or more columns
df['Employee', 'Salary'] ## wrong way to select multiple columns
Takeaways from Day 2
Today, I realized the critical role of clean data in analysis. Without properly handling missing values or applying accurate filters, insights can easily become skewed or misleading. Thankfully, Pandas provides intuitive tools that simplify these tasks, making data preparation a seamless process once you understand the basics. From detecting and filling missing values to filtering datasets with precision, I’ve gained confidence in dealing with messy data scenarios.
Moving forward, I’ll be exploring grouping and aggregation to uncover deeper insights from datasets and experimenting with advanced filtering techniques for more complex queries. If you’re on a similar journey or have questions about Pandas, feel free to connect in the comments. Let’s continue to learn and grow together! 🚀