Data Cleaning — The First Step of Professionally Dealing with a Data Set
Those who take interest in data science, machine learning, and artificial intelligence are often keen on learning how the algorithms are built, how models are trained, whether it requires programming knowledge to be a data scientist, where to find valid data sets for projects, how they can learn statistics, and such. Of course, it goes without saying that given how fascinating these fields can be, a curious mind would be interested in all this. You can find several thousands of resources that can help you learn different areas in the fields of machine learning, data science, AI, and since there’s a huge rise in the attention these topics get, several colleges and universities have even created courses and programs related to AI, ML, and data science in the last 10–15 years.
While the buzz is what it is, an active buzz, often some of the basics and topics considered boring or not challenging or intriguing enough are less talked about. One of those steps happens to be what’s called “data cleaning”. Catering to the curious minds, hoping to help young job seekers, a lot many resources focus much on topics related to these buzzwords — deep learning, regression models, reinforcement learning, big data analytics, natural language processing, etc. While learning all this can be pretty much fun and even intimidating sometimes, we must not forget that in the process of dealing with any kind of data, soon after data collection is done, the first step is to start cleaning it.
What happens when you don’t do data cleaning?
In the world of data science, there’s this popular, commonly used phrase: “garbage in, garbage out”. What it means is that if you feed your machine data that includes false information or irrelevant information or insufficient information, your machine would either not know how to deal with the same when it encounters all that, leading to errors, or your machine would not work properly.
Why is Data Cleaning important?
In an ideal, dream world, maybe, you’d get a data set that’s just perfect to train your machine with. But in the real world, almost every data set that you’re going to be dealing with would have missing values or unexpected values and might need what’s called data cleaning.
What’s Data Cleaning anyway?
You could consider this the first step in the process of preparing your data for your machine learning project. Imagine a child being sent to school and the school doesn’t care enough about what content their syllabus includes and what their syllabus does not as long as there’s some content to be taught. How horrible would that be for your kid’s education? That’s how important data cleaning is. In fact, when you make sure that the data cleaning part is done to the best of your abilities, presuming the other steps in the process are also carried out well, you can bet the results will be impressive.
Data Cleaning is a compulsory part of Data Analysis and Training a Model.
Note that there’s no one-size-fits-all method of data cleaning for all data sets and projects. Like how your data sets tend to vary from project to project, sample to sample, so will your data cleaning needs.
Missing Values: Missing values are one of the most common problem you’ll face while dealing with data sets. You might have seen many blank cells in CSV files if you’ve worked on some projects yourselves. In this case, database administrators would either remove those observations or replace the missing values with what’s considered the most applicable values in those cases, say like the median value.
Duplicate Observations: Another commonly encountered problem would be dealing with duplicates. Let’s say you’re dealing with a customers’ feedback data set. What’s the point of having one customer’s level of satisfaction stated 4 times, each entry being the same as the rest? Imagine you’re the class teacher and you’re checking students’ attendance for a workshop. What do you gain by having multiple entries of whether a student is present or absent? That’s what duplicate values are — they offer no real value, help the machine in no way, and just occupy space. These observations are better removed.
Low-variation data: In the last example, imagine that there’s a column in the data set for students’ presence or absence in a workshop, and that there’s a column which specifies that every entry in the data set plays the role of “student”. This again, just occupies space; what’s the point of knowing that all attendees are students every time? This could rather be removed as well.
Irrelevant Data: As the name implies, anything that’s not necessary or relevant to your subject or project can be considered “irrelevant data”. Consider this situation: you have a data set of employees’ satisfaction, and you’re asked to analyse women employees’ satisfaction for a project that hopes to make the workplace feel safer, more welcome, and satisfying for female employees as well. In this case, all the male employees’ answers can be removed from the data set for this project.
Incorrect data: One has to look through the data sets carefully to avoid feeding the machine incorrect data. Now, let’s say you’ve collected housing preferences from locals in a city. While all observations can seem to be more or less alike, what when you happen to spot a few observations of people who don’t live there in that city? What when ingredients and nutritional values of dairy products are included in a data set meant for vegan milk products? These are incorrect data and if fed to the machine, evidently, it will affect the machine’s accuracy.
Quantifying Data: Generally speaking, machines are fed with numerical data and that’s easier to work with, so if you have columns that show non-numerical data, try and change that to numerical equivalents so that the data just becomes easier to analyse. This is often done using techniques called “encoding techniques” in machine learning.
Dealing with Outliers: Imagine that you’re dealing with a data set that’s a record of financial transactions of bank customers. What if in a rare occasion or two, a transaction includes a sudden, huge amount? Does that automatically mean that some fraudulent activity is happening? What when a customer’s bank account shows no activity for a very long time? Does that mean that those observations should not be taken into account? Should these values be replaced by the median value? Would that be right?
These are all common problems that you’d see when you’re working with data sets of different kinds in real world projects. It is important that you learn data cleaning methods before you start working professionally. Sometimes, these data observations could be easily overlooked by us, which is why we need to pay attention and take time to do data cleaning. Or the data we would feed the machine would be unreliable, affecting the model’s accuracy levels and leading to a waste of time, effort, and resources.
Keep in mind, “garbage in, garbage out”! :)