Importance of Data Scrubbing

saranya Mandava
3 min readSep 18, 2018

--

In recent times, the most popular word across the web is “Data”. The immediate questions that comes to our mind are: How Does a data look ? What should be done to our data in order to make it useful?

Data is the information in raw or unorganized form that can come from a variety of sources.

Without cleaning our data, it looks ugly and analysis might be incomplete. “Dirty” data is sometimes hard to find and eradicate, but data cleaning techniques helps us in making spotless analysis.

What is Data Cleaning?

Data Cleaning also known as data scrubbing is the process of detecting and removing the inaccurate records. This involves identifying incomplete or irrelevant parts of the data and then replacing, modifying, or deleting the coarse data while ensuring our data is consistent with the common set of rules.

Data cleaning is more useful especially, when we have multiple sources of data without any communications between them and we need to merge them in order to make interpretations.

Now, Let us discuss about manual process of data cleaning with the titanic dataset, which has 15 columns and 891 rows.

Load our titanic dataset.

First step involves identifying and deleting duplicate records.

So, there are 107 duplicate records. So. Let’s go ahead and drop them.

Check to see if there are any redundant columns.

Upon analyzing the column names and the associated data, survived is an alias of alive, pclass is an alias of deck, adult_male and who are alias for sex and embarked is an alias for embarked_town.

After removing redundant columns from our dataset, we are left with less number of columns to analyze. Next step in the process is to identify the columns with null values.

So, there are only three columns with null values: age, deck and embark_town.

Check to see the number of null values each of these three columns has.

What should we do to these null values?

We should fill these values with the relevant values by analyzing the dataset. In this case, let’s fill the null values for deck and embark town with the values above those rows. This will make sense, since the family members travel together.

It’s not the case with age column. so let’s fill the null value’s of this column with the mean value.

Now, our data is ready for merging with other datasources or for anlyzing.

References:

https://www.cc.gatech.edu/~xchu33/SIGMOD2016Tutorial.pdf

--

--