The Imperative of Data Cleansing

Better data beats fancier algorithms

Published in

Analytics Vidhya

12 min readMay 27, 2020

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a recordset, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. [Wikipedia]

Why Data Cleansing/Cleaning?

If you talked to a Data Scientist or Data Analyst who had lots of experience in building machine models, he will tell you that preparing data takes a very long time and it is very important.

Machine learning models designed to get a huge amount of data and find patterns in this data to be able to provide intelligent decisions.

Let’s assume that you are building an ML model to classify between apples and orange images. If you input all your data with only orange images, then the model will not be able to predict the apple because it does not have enough data to learn and define patterns for apples.

This example till us “Garbage in Garbage out”

If data fed into the ML model is of poor quality the model will be of poor quality.

Problems with Data

How to tackle each one of the above problems?

Insufficient data

[Sherlock Holmes:] I had come to an entirely erroneous conclusion, my dear Watson, how dangerous it always is to reason from insufficient data.
-The Adventure of the Speckled Band

Models trained with insufficient data perform poorly in prediction. if you have just a few records for your ML model, it will lead you to one of two below know issues in ML modeling

Overfitting: Read too much for too little data.
Underfitting: Build an overly simplistic model from available data.

In real-world Insufficient Data problem, is a common struggle for the project, you might find the relevant data may not available and even if it is the actual processing of collecting the data it is very difficult and time-consuming.

The truth there is no great solution to deal with insufficient data, you simply need to find more data sources and wait for long till you have the relevant data collected.

But, there is something you can do to work around this problem but note that the techniques will discuss are not widely applicable for all use cases.

Now, What we can do if we have small datasets?

Model Complexity: if you have small data you can choose to work with a simpler model, a simpler model works better with fewer data.
Transfer Learning: if you are working with neural networks deep learning techniques you can use the transfer learning.
Data augmentation: you can try to increase the amount of data by using the data augmentation techniques, it usually uses with image data.
Synthetic Data: understand the kind of data that you need to build your model and use the statistical properties of that data and generate Synthetic artificial Data.

Model Complexity

Every machine learning algorithm has its own set of parameters. for example, simple linear regression vs decision tree regression.

If you have less data, choose a simpler model with fewer model parameters. A simpler model is less susceptible to overfitting your data and memorizing patterns in your data.

Some of the machine learning models are simple with few parameters like Naïve Bayes Classifier or Logistic regression model. Decision trees have many more parameters and consider as a complex model.

Another option to train your data using ensemble techniques.

Ensemble Learning: Machine learning technique in which several learners are combined to obtain a better performance than any of the learners individual.

Transfer Learning

If you are working with Neural Networks and you don’t have enough data to train your model transfer learning is may solve this problem.

Transfer Learning: the practice of re-using a trained neural network that solves a problem similar to yours, usually leaving the network architecture unchanged and re-using some or all of the model weight.

Transferred knowledge is especially useful with the new dataset when it is small and not sufficient to train a model from scratch.

Data Augmentation

Data Augmentation techniques allow you to increase the number of training samples and it is typically used with image data, you take all the images you are working with and perturb and disturb those images in some way to generate new images.

You can perturb these images by applying scaling, rotation, and affine transform. And these image processing options are often use preprocessing techniques to make your image classification models build using CNN or computational neural networks more robust, they can also be used to generate additional samples for you to work with.

Synthetic Data

Synthetic data comes with its own set of problems, basically, you will artificially generate samples that mimic real-world data. You need to understand the characteristic of what data you need.

You can oversample existing data points to generate new data points or you can use other techniques to generate artificial data but it can introduce bias in existing data.

Too Much Data

It might seem strange that too much data is a problem but what is the use of data if it is not the right data. Data might be excessive in two ways:

1- Outdated Historical Data: Too many rows.

Working with historical data is important but how important if you have too much historical data which is not really significant, you might end with something called ‘Concept Drift’.

Concept Drift: The relationship between features (X-variables) and labels (Y-variables) changes over time; ML models fail to keep up, and consequently their performance suffers.

Concept Drift means that the ML model continues to look it the stat of the world that is outdated and no more significant or relevant.

So, if you are working with historical data take the following in your consideration:

If not eliminated, it leads to concept drift.
Outdated historical data is a serious issue in specific when you are working with ML models that work with financial data especially if you are modeling the stock market.
Usually requires human experts to judge which rows to leave out.

2- Curse of dimensionality: Too many Columns.

Your samples which should use them to train ML model and every sample might have too many columns too many features in the simplest form when you deal with the curse of dimensionality you might end up using irrelevant features which really don’t help your model improve.

The Curse of dimensionality is a huge topic that has been studied in detail by data scientists.

Two specific problems arise when too much data is available:

Deciding which data is actually relevant.
Aggregating very low-level data into useful features.

Historical Data is a fairly hard problem to deal with, but the Curse of Dimensionality problems is easier to solve

How?

You can use Feature Selection to decide which data is relevant.

You can use Feature Engineering to aggregate the low-level data into useful features.

You can perform Dimensionality Reduction to reduce the complexity without losing information.

Non-Representative Data

There are several manifestations of non-representative data one is feeding the wrong features to your model and there are other manifestations as well. It is possible that the data you collected has inaccurate data in some way, small error significantly impacts on your model.

Another manifestation of non-representative data is biased data. For example, you are collecting data from 5 sensors from different 5 locations and one of those sensors is not working all the time, your data is biased because you don’t have proportional data from one of the sensors. Working with biased data leads to biased models that perform poorly in practice.

You can mitigate by oversampling and undersampling. So, if you have fewer data from one of the sensors you can oversample from the data that you have. so, you will have a representative sample.

Duplicate Data

If you are collecting data, they might be duplicate. If data can be flagged as duplicate, problem relatively easy to solve by Simply de-duplicate the data before you feed it yo your model.

But the world isn’t that simple, duplicate can be hard to identify ins some applications like real-time streaming. You can just live with it and account for it.

Missing Data

When you are working with data that is missing from the records that you have to train to your model. There are 2 approaches you could follow in order to deal with this data

1- Deletion

2- Imputation

Deletion A.K.A Listwise Deletion
Delete entire records (rows) if a single value (column) is missing.

This is a simple hassle-free technique to deal with missing values, but it can lead to bias because you might delete a very relative data.

Listwise deletion often is the most common in practice because it is easy, but it can lead to several problems. It can reduce the sample size that you are working with significantly. If you don’t have too many records and you applied this approach it might get into a situation where you have insufficient data to train your ML model.

Another thing to worry about, if values are not missing in random, removing them can introduce significant bias. For example, if your collecting data from sensors and there is a sensor has missing values in a particular field if you go ahead and dropped all record from that sensor. That can lead to significant bias.

So, it is pretty clear that simply dropping entire records which have a few fields missing not a great option which is why we move on to Imputation.

Imputation
Fill in missing column values rather than deleting records with missing values. Missing values are inferred from known data.

Once you have decided to use Imputation fill in missing values, there are a number of methods you can choose from very simple to very complex.

The simplest possible method is using the column average, you assume that the missing values essentially equal to the mean value in that column or for that feature. Another very similar option is to use the median value of that column or the mode of that column.

Another way to impute missing values I interpolate from other nearby values. This technique is useful when your records are arranged in some kind of inherent order.

Imputation to fill in missing values can be very complex, in fact you can build an entire ML model to predict missing values in your data.

Now you might want to perform imputation in a variate of ways

Multivariate Imputation

Univariate of Imputation: Rely on known values in the same feature or column.
Multivariate of Imputation: Use all Know data to infer missing data.

For example -for Multivariate of imputation- you might construct regression models from other columns to predict this particular column, you will iterate repeat this for all columns that contain missing values.

There are other techniques that you can apply to fill missing values as well.

Hot-deck Imputation

You will sort all the records you have based on any criteria. for each missing value you can use the immediately prior available value.

Filling missing values using the previous value, one of your records have been ordered. This is specially used for fill time series data where progression in time has a useful meaning.

For Time series, equivalent to assuming no change since last time measurement.

A common technique that often uses is Mean Substitution.

Mean Substitution

For each missing value, substitute the mean of all available values. The mean substitution has the effect of weakening correlations between columns that exist in your data.
When you essentially say this an average data point there is nothing special about it, you weaken correlation and this can be problematic when you are performing bivariate analysis, this is analysis to determine the relationship between 2 columns.

If you want to intelligently predict the missing values in your data you might want to use Machine learning.

Regression

Fit a model to predict missing columns based on other column values. Applying this technique tends to strengthen correlations that exist in your data because you essentially saying that this column is dependent on the other column.

Regression and Mean Substitution to fill missing values have complementary strengths

Outliers

Outlies: is a data point that differs significantly from other data points in the same dataset.

If you visualize your data, you might find something like this

It might be the entire record is an outlier in some manner or there are certain fields with outlier values.

When dealing with outlier data it is 2 step process first step is Identifying Outliers that exist in your data the second step is to use techniques to cope with these outliers.

There are specific ML algorithms that have been built for outlier detection but at a very basic level you can identify outliers by seeing the distance of that data point of the mean of your data or the distance from the line that you fit in your data.

Once you identify outliers, you can cope outliers using three techniques: Drop records with outlier data, cap/floor outliers, or set outliers to the mean value of that feature.

Identifying Outliers: Distance from the mean

If you have a data point with a value far from the mean that can be considered to be an outlier, or you can perform some regression analysis and find a line or a curve that follow the pattern in your data in if you have a point that is far from this fitted line that is an outlier.

When you want quickly to summarize any set of data you are working with the first measure that you will indicate is the mean of that data.

The mean or average is the one number that best represents all of these data points.

Mean or Average of any set of data is the sum of all of the numbers divided by the count of the numbers.

However, along with the mean the variation that exists in your data also important.

The variation: is a measure of whether the number in your dataset jumps around.

One important measure of the variation in your dataset is the range.
The ranges completely ignore the mean and it is swayed by outliers that are present in your dataset, that’s where variance comes in.

Variance is the second most important number to summarize any set of points you are working with.

Mean and Variance succinctly summarizes any set of numbers.

Along with variance another term you might encounter is the standard deviation.

Standard Deviation: is the square root of the variance and is a measure of variation in your data.

The standard deviation helps you express how far a particular data point lies from the mean.

Identifying Outliers: Distance from the fitted line

Outliers might also be data points that do not fit into the same relationship as the rest of the data.

Coping with Outliers

Once you identify the outliers you need to figure out how do you deal with them, how do you want to cope with them?

Always start by scrutinizing outliers that exist in your data

If erroneous observation:

Drop the entire record if all attributes of that point are erroneous
In row or record has one attribute that you feel it is erroneous, Set it to the mean.

It quite possible that your outlier data is not an incorrect observation, it is a genuine, legitimate outlier. You have 2 approaches to deal with this kind of outliers:

Leave as is if model not distorted
Cap/Floor if the model is distorted, but first you need to standardize your data.

Final Words…

Since you need to use data for making critical business decisions, it is safe to say that cleansing and enriching your data periodically is a must.

Thank you for reading!
I hope you enjoyed my articale about Data Cleaning.