A Practical Guide To Data Cleaning For Beginners

Kelechi
Coinmonks
9 min readApr 7, 2018

--

As beginner data scientists, we are all so eager to begin making data science models without properly dealing with the data. This is a common pitfall that leads to poor models. In order to improve the quality of your model, data cleaning is required.

I am interested, but can you provide a definition for data cleaning?

Sure, I got you.

According to Wikipedia, Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Scary, right?

Now, forget that very long sentence. Let me break it down for you in a simple sentence.

Data cleaning is basically the act of identifying and correcting false, incomplete or irrelevant parts of data.

Sounds better, yeah?

Okay, but what is the importance of this data cleaning?

According to an article on Forbes, by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet. This means that different types and sources of data will be at play, and that could prove to be messy.

Yes, there is usually a lot of buzz around the data science algorithms and not data cleaning. This is simply because data cleaning is the least fun part of data science.

If data science is the sexiest job of the 21st century, then data cleaning is the most unsexy part of data science.

The importance of data cleaning in training a model cannot be overstated. No matter the algorithm you use — if your data is bad, you will get bad results. Professional data scientists know this and have revealed that data cleaning takes up to 70% of the time spent on a data science project.

Better and cleaner data outperforms the best algorithms. If you use a very simple algorithm on the cleanest data, you will get very impressive results. However, if you use the best algorithm on messy data, you would most likely not get the desired result.

Essentially, it is garbage in, garbage out.

Enough of all this! How do I perform this data cleaning thing?

Jeez. Chill out! We were just getting to that. Before we get started, it is advised you open the code examples here — so you can follow along well.

1.Dealing with Missing Values: This is perhaps the most common trait of unclean data. These values usually take the form of NaN or None. Missing values can also lead to our models predicting NaN values, which we would not want. So, how do we tackle these missing values?

The data used in the code example contains information about building permits issued in San Francisco from Jan 1st, 2013 to Feb 25th, 2018.

After reading in your data, you need to first establish that there are missing values in the dataset.

It is important to also check the percentage of data missing. This will give you an idea of how to handle the missing values. We can also visualize how much data is missing per column.

There are so many missing values!

Now that we have established that there are missing values in the dataset, how do we handle them?

In order to handle missing data, it is important to identify the cause of the missing data. This will guide us in deciding on the best way to handle them.

There are several causes of missing data values. Some values could be missing because they do not exist, others could be missing because of improper collection of data or poor data entry. For example, if someone is single and a question applies to married people, then the question will contain a missing value. In cases like this, it would be wrong to fill in a value for that question.

In our data, for instance, Street Number Suffix does not always exist, so it would not be ideal to fill it up. However, the missing locations, zipcodes and permit expiration date can be filled up.

There are several ways to fill up missing values, and we will explore some of them here.

We can decide to drop the rows or columns with missing values. This is only advised when the percentage of missing values in a row or column is so high that it does not make sense to keep the column or row.

For instance, the TDIF compliance column and the Voluntary Soft-Story Retrofit have near 100% missing values, so it makes sense to drop them. We can also decide to drop columns where the missing values are at least 60% of the data

The same can also be done for the rows of any data set by just specifying that the axis = 0

If we are dealing with numerical values, we can decide to fill all null variables with 0. We can also decide to fill in all NA values with the mean of all the values in the column, or the median. All of these decisions depend on the type of data,what you want to accomplish with the data, and the cause of the missing values.

As you can see, the missing values have been replaced with 0

Finally, we can also decide to fill up missing values with whatever value comes directly after it in the same column. We will also specify that any missing value which does not have any value directly under it will be filled with 0.

2. Inconsistent Data Entry: This occurs when there are different unique values in a column which are meant to be the same. There are also other types of inconsistent data entries, but this will be our focus here. The data setto be used here contains information about Suicide Attacks in Pakistan from 1995 to 2016. As usual, we will first import all relevant libraries and read in our data.

We will handle the inconsistent entries in the Province column. To check for all unique values, we do this:

We can guess that ‘Balochistan’ are probably meant to be the same ‘Baluchistan’. A simple Google search suggests that we are right. Also, ‘FATA’ and ‘Fata’ are most likely meant to be the same.

One of the ways to remove data inconsistencies is by removing whitespaces before or after entry names and by converting all cases to lower cases. For white spaces, if we had something like ‘ fata’ and ‘fata ’, they would be considered two unique values.

Nice, we can see that fata and FATA have been unified

We still have a problem though. We need to unify ‘balochistan’ and ‘baluchistan’ entries.

Niceee. Seems like we have solved it all.

NB: If we have a large number of inconsistent unique entries, we cannot check with our eyes like we just did for the closest matches. We can use the Fuzzy Matching. This helps to identify which strings are most likely to be the same. A package — Fuzzy Wuzzy — does this well. The FuzzyWuzzy package to be used takes in two strings and returns a ratio. The closer the ratio is to 100, the more likely we will unify the strings.

3. Handling Dates and Times: We are often faced with dates in our datasets. And, as beginners, this can often prove daunting. It is even worse when there are inconsistent formats of dates. For instance — dd/mm/yy and mm/dd/yy in the same columns. Also, our date values might not be in the right data type, and this will not allow us effectively perform manipulations and get insight from it. Fear not though, as we will solve all of that here.

As usual, we will first import all relevant libraries and read in our data.

The data to be used here contains information about a record of the date, time, location, depth, magnitude, and source of every earthquake with a reported magnitude 5.5 or higher since 1965. All compiled by The National Earthquake Information Center (NEIC).

If we take a look at the Numpy documentation, the datatype O matches with the Object data type, which is not the date time data type

We proceed to convert our date datatypes to date time data types. We will need to import the datetime package to handle this.

As we can see, the data type has correctly changed to the datetime data type, and now we can perform various date time manipulations on the column because of its data type.

If all your dates are in a specific format, you can easily feed in that format into the datetime when converting. However, all our dates are not in a specific format, so we asked the function to automatically infer the format.

Hence, we need to be sure all our dates converted well since we asked the function to infer the format. A simple mistake in data entry could lead to a wrong inference.

To do this, we will plot a barchart of the months and ensure that the highest value is 12.

As you can see, the highest day is 12, so we are good to go.

We can also do the same for the days of the months. The highest value should be 31, and there should be a dip there since not all days of the month have 31 days.

Nice, our hypothesis is correct. Seems like we got this one right.

4. Scaling and Normalization: Let us take them one by one:

Scaling is important if one wants to specify that a change in one quantity is not equal to another change in another. This is particularly useful if you want to work with algorithms that treat a change in a numerical feature as an equal change in another numerical feature. Scaling ensures that just because some features are big it won’t lead to using them as a main predictor. For instance, if the age and the salary of a person are being used in prediction, some algorithms will pay attention to the salary more because it is bigger. Also, an increase in the age by 1, will be considered similar to an increase in the salary by 1. This does not make sense, so it makes sense to scale your data.

Normalization basically involves transforming or converting your dataset into a normal distribution. This is useful when you want to use algorithms that work well with normal distributions. Some algorithms like SVM converge far faster on normalized data, so it makes sense to normalize your data to get better results.

We would normalize a data and visualize it before and after normalization.

The data to be used here contains information about over 300 000 kickstarter projects.

Nice, we can see that our data now resembles a normal distribution

Conclusion

There are several other forms of data cleaning that might need to be done on data. However, this is a good primer for absolute beginners. The need to properly arrange and tidy up your data before the formulation of any model cannot be overstated. I hope this article has helped spark that interest in you! :)

Very well explained code examples are in this github repo.

PS: For further reading on Handling missing data — you can read this brilliant article.

Get Best Software Deals Directly In Your Inbox

--

--