Scrub your data until it speaks out stories!

Published in

Udacity Bertelsmann Data Science Scholarship 2018/19 Blog

5 min readAug 2, 2018

According to a big data article published in Forbes by Press (2016), data scientists spend 60% of their time on pre-processing and organizing data. Well, that’s a bit too much! Aren’t they supposed to be working on researching and developing new machine learning based algorithms? So, why waste time on pre-processing and what exactly does it mean?

It simply means — scrubbing the dirt off the data and altering it to prepare for better visualization and better readability to derive meaning from the data.

Ahh! Sounds simple — erm, but, what kind of dirt is any data set associated with? How does it appear like? How do we know it’s dirt and it needs to be removed? And what kind of changes needs to be made?

Well to answer all these questions, let’s dive a bit deeper into data; traditionally most of the data scientists follow these steps in pre-processing (That’s what wikipedia says!)

Data Cleaning
Instance selection
Normalization
Transformation Feature extraction

That’s a lot of buzzwords! I thought it was just cleaning the data and making it look good. Let’s take baby steps and get to the first step and explore what is data cleaning in this blog.

Identify the 3I’s in the data and keep an eye on them!

Types of dirt that needs to be scrubbed off the data

For instance, consider the below table where the survey answer for the question “How would you describe your life” is incomplete and inconsistent which does not convey any meaning or useful inference. In the second column “Age of your grandmother” has a response of 10 years old which is obviously inaccurate and not right. This leads to inaccurate analysis since the mean of the age data column will be affected by the wrong age. In the third column, annual pay rate of an employee in a state does not contribute to analysis of the diseases in the states and hence such data can be ignored since they are irrelevant.

An instance of the 3 I’s lurking in the dataset

Let us consider an actual data set and identify the dirt settled on them

Dataset Source : https://think.cs.vt.edu/corgis/csv/music/music.html

We shall first explore the attributes of the music data set by printing all the columns in the data frame.

Python has this incredible power of spotting the presence or absence of NULL values in the data set.

Null value distribution in the music dataset

Missing Values

As we can observe, the attribute artist_mbtags has the most number of null values in the dataset when compared to the other attributes. Around 63% of the values of the artist_mbtags has NaN value which accounts to more than half of the dataset being null valued for that attribute. We can handle this by eliminating the attribute which would be the best method of action.

Let’s take one of the attribute given and dig deep and see what kind of values it contains.

Selection of a single attribute for checking null value

Statistics for the song hotness attribute

As we can see from the previous null value distribution table, the song hotness attribute has around 43% null values. This is less than half of the data set and can be replaced with either mean, median or mode or zero’s. For better understanding of the data, it is always a good practice to perform exploratory data analysis.

Song hotness measure of each song (10,000 values)

As we can see in the second graph, the drilled down sampled version of the song hotness of each album clearly shows the presence of NULL values.

2. Inaccurate values

Inaccurate text values are relatively easier to notice in the excel sheet for which corrective actions can be implemented using python scripting.

Inaccurate song name with special characters

NLP (Natural Language Processing) can be implemented on any text related data which makes the analysis simpler. But text data like the one shown in above figure can be handled by regular python scripting.

3. Irrelevant attributes

The main idea behind using the data set is to understand and analyze the factors affecting the popularity of music created by the artists . This scenario, does not demand the need of the release.id attribute since the relationship between the popularity of the music does not depend on the release.id nor can any conclusions be drawn from that.

That’s it for now folks, more to it in my next article! Enjoyed writing my first article on Medium :) Hope you got something out of it too!

Thanks to Udacity Bertelsmann’s program and to women_of_code for giving me a chance to express my thoughts!

If you enjoyed the content, please leave a 👏

Scrub your data until it speaks out stories!

Written by Anupama Garani