Data Cleaning: Inconsistent Data Entry

nishant sethi
Sep 8, 2018 · 5 min read

In my previous post of Data Cleaning, we discussed How to work with different character encodings. In this post, we’re going to learn how to clean up inconsistent text entries. This is going to be the Last Post on Data Cleaning. To get started, Download the required Dataset which I’m going to use in this post. You can download it from my GitHub link.

Here’s what we’re going to do today:

Let’s get started!

Get our environment set up

The first thing we’ll need to do is load in the libraries we’ll be using. Not our datasets, though: we’ll get to those later!

When I tried to read in the PakistanSuicideAttacks Ver 11 (30-November-2017).csvfile the first time, I got a character encoding error, so I'm going to quickly check out what the encoding should be...

And then read it in with the correct encoding. (If this look unfamiliar to you, check out this post.)

Now we’re ready to get started! You can, as always, take a moment here to look at the data and get familiar with it. :)

Do some preliminary text pre-processing

For this exercise, I’m interested in cleaning up the “City” column to make sure there are no data entry inconsistencies in it. We could go through and check each row by hand, of course, and hand-correct inconsistencies when we find them. There’s a more efficient way to do this though!

Just looking at this, I can see some problems due to inconsistent data entry: ‘Lahore’ and ‘Lahore ‘, for example, or ‘Lakki Marwat’ and ‘Lakki marwat’.

The first thing I’m going to do is make everything lower case (I can change it back at the end if I like) and remove any white spaces at the beginning and end of cells. Inconsistencies in capitalizations and trailing white spaces are very common in text data and you can fix a good 80% of your text data entry inconsistencies by doing this.

Use fuzzy matching to correct inconsistent data entry

Alright, let’s take another look at the city column and see if there’s any more data cleaning we need to do.

It does look like there are some remaining inconsistencies: ‘d. i khan’ and ‘d.i khan’ should probably be the same. (I looked it up and ‘d.g khan’ is a separate city, so I shouldn’t combine those.)

I’m going to use the fuzzywuzzy package to help identify which string is closest to each other. This dataset is small enough that we could probably correct errors by hand, but that approach doesn’t scale well. (Would you want to correct a thousand errors by hand? What about ten thousand? Automating things as early as possible is generally a good idea. Plus, it’s fun! :)

Fuzzy matching: The process of automatically finding text strings that are very similar to the target string. In general, a string is considered “closer” to another one the fewer characters you’d need to change if you were transforming one string into another. So “apple” and “snapple” are two changes away from each other (add “s” and “n”) while “in” and “on” and one change away (rplace “i” with “o”). You won’t always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little time.

Fuzzywuzzy returns a ratio given two strings. The closer the ratio is to 100, the smaller the edit distance between the two strings. Here, we’re going to get the ten strings from our list of cities that have the closest distance to “d.i khan”.

We can see that two of the items in the cities are very close to “d.i khan”: “d. i khan” and “d.i khan”. We can also see the “d.g khan”, which is a separate city, has a ratio of 88. Since we don’t want to replace “d.g khan” with “d.i khan”, let’s replace all rows in our City column that have a ratio of > 90 with “d. i khan”.

To do this, I’m going to write a function. (It’s a good idea to write a general purpose function you can reuse if you think you might have to do a specific task more than once or twice. This keeps you from having to copy and paste code too often, which saves time and can help prevent mistakes.)

And now let’s can check the unique values in our City column again and make sure we’ve tidied up d.i khan correctly.

Excellent! Now we only have “d.i khan” in our data frame and we didn’t have to change anything by hand.

And that’s it. This is the last post about Data Cleaning. In next week, I’m going to publish my second post A Machine Learning Classification Model in Python — Part II.

I’ll keep on posting these type of articles on Machine Learning. So, you can follow me to get all the update and also you can appreciate my work by clicking on the clap button.

I welcome feedback and constructive criticism and can be reached on Facebook.

Happy coding!!

Data Driven Investor

from confusion to clarity, not insanity

nishant sethi

Written by

Senior Software Developer at Infosys

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade