Data cleaning in Python— dealing with duplicates and missing values with the 3 aces up the sleeve

Published in

Women in Technology

5 min readNov 12, 2023

New project = new challenges when it comes to dealing with duplicates and missing values; so let’s be honest, who doesn't like a bit of spice in its life?

Starting a new project feels like exploring a virgin territory full of surprises. However, along the way, we might encounter some uninvited guests, named Duplicates and Missing Values, who seem to make themselves right at home. Join me on this journey and we'll learn the art of handling these situations and sending the 2 unwanted guests to their home.

In this previous article, I shared with you the primary strategies that guided us through the initial phase of the data-cleaning process, where understanding the dataset took center stage. Now, our journey continues as we delve into the art of handling duplicates and addressing missing values that have surfaced in our dataset.

Ready? Let’s do it 🫧

Dealing with missing data

There are 2 ways to handle duplicated values:

1. Replace them with general values, such as ‘Unknown’ (or any value you find suitable for you)

This action can be done using thefillna() function; as its name suggests, it helps us to fill null values with something more appropriate.

isna()function helped me to count the NULL values in my dataset:

As we can see in columns like: director, cast, and country we have more than 800 missing values; so the next step is to replace them by applying the fillna() function, just as demonstrated below:

After doing so, it’s essential to verify whether any missing values persist in these columns:

Ta-da! The magic is complete and our diligent use of the fillna() function has successfully banished the NULL values from our dataset 🥳

2. Delete the rows that contain the missing values — usually NOT RECOMMENDED, talk with your client before taking this approach.

We already saw that we have more than 800 NULL values in the ‘cast’ column, so we want to delete them. To do so we need the dropna() function:

After executing this line, let’s check if we still have duplicates in the ‘cast’ column:

and see that now we have 0 missing values in our column.

To drop missing values for the entire dataset, use the same dropna() function but do not specify the subset property

We sent Missing Values home. Pretty cool, right?! 😉

Dealing with duplicated data

We’re all human and it’s normal to make mistakes, so having duplicate data isn’t something we can completely avoid. The good news is that we can deal with it (but let’s aim for fewer mistakes next time 🙂). So, to have a clean and ready dataset for further tasks, we need to remove duplicate values using the drop_duplicates() function.

For simplicity, we have this dataset:

I’m sure your experienced eyes have already noticed that John appears twice, and we need to eliminate one of the two rows:

so, the cleaned dataset looks like below:

You can remove duplicates from certain columns by adding the subset property, like below:

Easy, right? 😁

Conclusion

The functions discussed here — dropna(), fillna(), and drop_duplicates()—are indispensable allies in this quest for cleaner, more reliable data.

In some of my projects, i also used loc to address some specific things that i found not good for my dataset. For example, when i had a few rows that contained few NULL values, i opted to fill them with actual data using the following approach:

So, the next time you find yourself face-to-face with Mrs. Missing Values or Mr. Duplicate Data, take Pandas as your ally and be confident in your ability to usher them out the door, leaving behind a dataset that’s ready for analysis and exploration. Happy coding!

Thank you so much for your support, it means a lot to me.

If you found this article interesting and helpful, you have the option to support my work here ☕😊

P.S.: Visit my medium and embark on an exciting journey of discovery today. Happy reading!