When Quality > Quantity

Athira Lonappan
Fields Data
Published in
6 min readMar 17, 2022

You are looking forward to baking a cake this weekend. All ingredients including flour, sugar, eggs and baking soda are set. You bake your favourite chocolate cake that you have been craving for. You take the cake out of the oven and oops! The cake doesn’t rise much. What went wrong? You go through the process over again only to realise that the baking soda had expired. Hours of patience resulted in a flat and dense cake. Isn’t this relatable to many of us? This could have been avoided. All we had to do was check the quality of the baking soda before using it. One simple step could have saved hours.

Similarly, when it comes to analysing data it is very important that we check the quality of the data before we dive into finding patterns and this process is called Data Preprocessing.

Data Preprocessing

Data Preprocessing is the process of manipulating the data to transform the collected raw data into clean, understandable data in order to enhance performance. Often the data collection process leads to poorly handled data resulting in missing values and out of range values. Analysing such data could lead to false results making this an important step.

Data preprocessing can be broken down into 4 major tasks:

  1. Data Cleaning — Process of removing inaccurate and incomplete data from the datasets.
  2. Data Integration — Process of combining multiple data sets into a single dataset.
  3. Data Reduction — Process of compressing the volume of data.
  4. Data Transformation — Process of changing structure of the data for better analysis.

In this blog, we will be majorly focusing on the Data cleaning process.

Data Cleaning Techniques

There are few techniques that can be used to clean the raw data which are as follows:

  1. Handling Missing values
  • The traditional approach of filling the missing values manually is not recommended since it would become tedious in a big dataset.
  • The missing values could be replaced with ‘NA’ or dropped incase of huge datasets. Incase of small datasets every single record will have an impact on the analysis.
  • Apart from these solutions, a statistical approach of filling the missing values would be a good way of handling it.

Mean: Filling out the missing values with the mean of all the values in that specific column is one of the solutions. This is only applicable when the data is normalised.

Example:

Consider a dataset of salary of all the employees in the sales department of an organisation.

As seen above the values are spread within a range of $2000 to $4000 USD. All records have similar values and there are no extreme values or no outliers, meaning there are no values that lie outside most of the other values. In such a scenario, we calculate the mean of all the values to fill the missing salary value.

Median: In situations where records are not within a certain range or there exists outliers, median of the feature column could be used to fill out the missing value.

Example:

Continuing with the previous example, suppose in the existing dataset the salary of the Department Head ($5500 USD) is also added as a record then the mean value will be comparatively higher for obvious reasons. Hence, we calculate the median salary instead of mean which is the middle value in the sorted list of data.

Mode: Replacing missing values with the mode value can be done for both numerical as well as categorical data. Here, we replace the missing data with the most frequent value of the entire column.

Example:

Numerical data- In the above example of non-normal distributed data, mode can be used as an alternative for median, where the most frequent salary value can replace the missing value.

Categorical data- In the below dataset, to fill out missing values, we find the mode value which is the most common activity type in this case being ‘Humanitarian ’ .

2. Getting rid of noisy data

Just like you wish broccoli don’t ruin your pizza, you do not wish to have additional meaningless information in your data. Following are the techniques that can be used to remove noise from your data.

Binning: Binning method smoothens the data by consulting its neighbours. Data smoothing helps to find patterns and predict trends by ignoring the outliers. In binning, firstly the sorted values are distributed into a number of buckets or bins of equal frequency. Secondly, either of the following three approaches are used to perform smoothing.

  1. Smoothing by bin means: Each value in bin is replaced with the mean value of the bin.
  2. Smoothing by bin medians: Each value in bin is replaced with the median value of the bin.
  3. Smoothing by bin boundaries: Each value in bin is replaced with the closest value when compared to the minimum and maximum values.

Example:

Regression: Regression is a technique which takes into consideration the variables which are used to predict an output and finds a mathematical relation between them.

a. Linear Regression revolves around finding the best relation to fit two attributes.

It is represented by the equation: y=mx + c

where,

y: dependent variable

x: independent variable

Example: One of the features the price of a house depends on is the area. Hence Area is the independent variable(x) and Price is the dependent variable(y).

b. Multiple regression which is an extension of linear regression involves more than two attributes.

It is represented by the equation: y=m1x1 + m2x2 + m3x3 ……..+ c

where,

y: dependent variable

x1 , x2, x3: independent variable

Example: Predicting the price of the house(y) given the number of bedrooms(x1), bathrooms(x2), parking(x3).

Clustering: Clustering also called as outlier analysis is the technique of dividing the data points and assigning similar values to a cluster.

Example: Clustering organisations depending on the similarity in sectors that they work in.

Thus, just like a burger is incomplete without cheese, such is the data with missing values. Handling these missing values is important to achieve true analysis. Similarly, datasets with noisy data is like having to eat a pizza with your least favourite toppings. Getting rid of it will help find insightful patterns and trends for further analysis and better decision making.

So next time, make sure to do a quality check on your data before jumping into conclusions.

--

--

Athira Lonappan
Fields Data

Penning down my learnings as I start out as a Data Engineer