Get rid of the dirt from your data — Data Cleaning techniques

Caston Fernandes
7 min readMay 11, 2018

--

Raw data refers to data extracted from the Internet by web scraping or collected through some kind of channels (say, Google forms, surveys or similar data gathering methods). Why do we refer this data to “dirty data”? Because, this data is at times is a)incomplete, b)noisy or c)inconsistent.

When we say incomplete — it refers to missing data. For example — you might fill out a survey by skipping your personal information such as age.

When we say noisy — it refers to meaningless data. For example, out of range values like a person filling out the numeric value -679 in the salary field or some negative four digit random number in the age field. Impossible data combinations like — Gender: Male, Pregnant: Yes adds to the noise in the data. Noisy data is used interchangeably with the term corrupt data.

Lastly, Inconsistent data is when data fails to match. Let’s say, the user entered birthday to be May 07, 1993 and the age attribute displays 50. Or over time the ratings of a movie have changed from the numeric rating 1, 2, 3 to alphabets — A, B, C. Thus, in the same column data is not consistent.

The following example should make you understand better.

Figure: Dirty Data

Cause of this dirt: Dirty data is produced due to human errors (say, typo errors), form design issues (say, the missing ID field above was not mandatory) and many other factors. In addition data gathering methods are often loosely controlled resulting in missing and out-of-range values. Dirty data is inevitable.

Data Pre-processing: Thus, in order to make the best use of the data you have, it is very important to get rid of dirty data. This is Data pre-processing. Incorrect data always hampers and deteriorates the quality of analysis done. Performing data mining on incorrect data may give us misleading results.

Therefore in this article, the focus is on the first stage of Data pre-processing which is Data Cleaning.

The initial stages of handling missing values and noisy data which is actually Data Cleaning — is pretty simple and easy for a beginner who needs to clean his dataset.

When we say — get rid of the dirty data — does not mean to go ahead and just delete that data or that row. Because just by eliminating a cell we lose the other data present in that row too. Thus we treat the dirty data by using some of the techniques explained below in order to make it clean to add more value to the dataset. If you fail to process this dirty data then — Garbage In, Garbage Out is what you would mostly do. Data pre-processing is vital and it occupies forty to seventy percent of the entire workflow. Definitely, the most important stage.

Please Note: Before you move ahead, it is very IMPORTANT to UNDERSTAND THE COMPLETE ENVIRONMENT of the DATA. In this case, you should know how the numbers in the Student ID make sense. The moment you see a 5.0 GPA, you should question whether this GPA is calculated out of 4 or 10. It should strike you that this is a university data and the Classification column might include Graduate and freshman students as well. You will only know this when you scroll through the data and if at all even if one cell contains Graduate from thousands of rows then you need to consider that. Try visualizing the data before dealing with the missing values. Visualizations often produce correlations between attributes. For example, age and classification: higher the age, it’s most likely that the student is a graduate doing master’s or a PhD. It is through Data Visualization that you encounter noisy data. After plotting the above data (say, Age versus GPA — you will now encounter age ranging between 200 and -22 in the plot which you might miss if you perform a manual check.) I may in future write on Data Visualization using Tableau. Always try to question the data for its correctness. Later on, feeding incorrect data to your models is risky which may in turn lead to making incorrect business decisions based on incorrect model results.

Ways to handle missing data:

1. Data Deletion: This is the last thing you would do if there is no other option available to handle the missing data. Last thing because if you delete the row with any missing values you lose other data present in that row too. Delete data if and only if the size of your data is huge and deleting quite a few rows does not hamper the size. You should not lose more than seven to ten percent of your data. Secondly, while deleting keep an eye on other attributes of that row too. Cross — check to see whether while deleting all missing GPAs you are not getting rid of all sophomores in your dataset. In other words, deletion of data should not result in loss of a class of data.

2. Imputation: Imputation refers to substituting the missing values through guessing. Again, this is the second — last thing you should do.

a. Hot Deck or Educated Guessing: This technique refers to guessing the missing value manually by carefully studying the other values in the dataset. For example, if there is a missing GPA in the above dataset and the classification is a sophomore and all other sophomores are having a 3.5 GPA then you may go ahead and impute 3.5 in place of the missing value. In some cases, you might just select a random sophomore and impute his GPA in place of the missing GPA (Random Hot Deck) or in some cases you may club more than two other attributes of sophomores whose age and ID (carefully study the ID since it’s got the year in it) are close to the one whose GPA is missing. This is deterministic hot deck. Also referred to Nearest Neighbor at times.

b. Cold deck: If at all, you may happen to have some other dataset containing the same student, you may try grabbing the missing GPA from that dataset.

c. Average: If other sophomores are having a GPA in a specific range (say, between 2.7 and 4.0) then you may impute the missing GPA as an average of those. This is not always recommended since it could artificially reduce the variance in the data but it makes sense in some cases.

d. Mean: You might impute the mean of all other GPAs to substitute the missing GPA.

e. Mode: You might replace the missing GPA with the most occurring GPA in the GPA column. It is again your choice to consider the most occurring in the class of only sophomores or the entire dataset.

f. Median: You might consider sorting the other GPAs and selecting the central one.

g. Regression: You may try regressing the missing GPA value over the others and replace the missing one with the predicted outcome.

h. Expectation Maximization: This algorithm determines the maximum likelihood estimates for model parameters when data has missing points. It’s an iterative approach to approximate the maximum likelihood function.

i. Constant: You may replace the missing values of a column by using a constant such as “Unknown” or “ ∞”.

Ways to handle noisy data:

The process of removing noise from a data set is termed as data smoothing. The following ways can be used for Smoothing:

1. Binning: Binning is a technique where we sort the data and then partition the data into equal frequency bins. Then you may either replace the noisy data with the bin mean, bin median or the bin boundary. This is a simple example of data binning.

2. Regression: To perform regression your dataset must first meet the following requirements apart from the data being numeric. In such conditions you may encode the classification column as 1 for sophomore, 2 for Junior, and 3 for Senior. Linear regression refers to finding the best line to fit between two variables so that one can be used to predict the other. Multiple linear regression involves more than two variables. Using regression to find a mathematical equation to fit into the data helps to smooth out the noise.

3. Outlier Analysis: Outliers may be detected by clustering, where similar or close values are organized into the same groups or clusters. Thus, values that fall far apart from the cluster may be considered noise or outliers. Here is a brief overview of this technique.

Additionally, An exhaustive list of smoothing algorithms could be found here.

Data cleaning is an important stage. After all your results are based on your data. More the dirt, more inaccurate your results would prove.

Data Cleaning — is eliminating noise and missing values. Data Cleaning is just the first of the many steps for data pre-processing. In addition to the above, Data pre-processing includes Aggregation, Feature Construction, Normalization, Discretization, Concept hierarchy generation which mostly deal with making the data consistent. Data pre-processing, at times, also comprises 90% of the entire process.

--

--