The Imperative of Data Cleansing — Part 1

Importance of Data Cleaning, what are the data problems? and How to tackle those problems?

Published in

The Startup

7 min readMay 21, 2020

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a recordset, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. [Wikipedia]

Why Data Cleansing/Cleaning?

If you talked to a Data Scientist or Data Analyst who had lots of experience in building machine models, he will tell you that preparing data takes a very long time and it is very important.

Machine learning models designed to get a huge amount of data and find patterns in this data to be able to provide intelligent decisions.

Let’s assume that you are building an ML model to classify between apples and orange images. If you input all your data with only orange images, then the model will not be able to predict the apple because it does not have enough data to learn and define patterns for apples.

This example till us “Garbage in Garbage out”

If data fed into the ML model is of poor quality the model will be of poor quality.

Problems with Data

How to tackle each one of the above problems?

Insufficient data

[Sherlock Holmes:] I had come to an entirely erroneous conclusion, my dear Watson, how dangerous it always is to reason from insufficient data.
-The Adventure of the Speckled Band

Models trained with insufficient data perform poorly in prediction. if you have just a few records for your ML model, it will lead you to one of two below know issues in ML modeling

Overfitting: Read too much for too little data.
Underfitting: Build an overly simplistic model from available data.

In real-world Insufficient Data problem, is a common struggle for the project, you might find the relevant data may not available and even if it is the actual processing of collecting the data it is very difficult and time-consuming.

The truth there is no great solution to deal with insufficient data, you simply need to find more data sources and wait for long till you have the relevant data collected.

But, there is something you can do to work around this problem but note that the techniques will discuss are not widely applicable for all use cases.

Now, What we can do if we have small datasets?

Model Complexity: if you have small data you can choose to work with a simpler model, a simpler model works better with fewer data.
Transfer Learning: if you are working with neural networks deep learning techniques you can use the transfer learning.
Data augmentation: you can try to increase the amount of data by using the data augmentation techniques, it usually uses with image data.
Synthetic Data: understand the kind of data that you need to build your model and use the statistical properties of that data and generate Synthetic artificial Data.

Model Complexity

Every machine learning algorithm has its own set of parameters. for example, simple linear regression vs decision tree regression.

If you have less data, choose a simpler model with fewer model parameters. A simpler model is less susceptible to overfitting your data and memorizing patterns in your data.

Some of the machine learning models are simple with few parameters like Naïve Bayes Classifier or Logistic regression model. Decision trees have many more parameters and consider as a complex model.

Another option to train your data using ensemble techniques.

Ensemble Learning: Machine learning technique in which several learners are combined to obtain a better performance than any of the learners individual.

Transfer Learning

If you are working with Neural Networks and you don’t have enough data to train your model transfer learning is may solve this problem.

Transfer Learning: the practice of re-using a trained neural network that solves a problem similar to yours, usually leaving the network architecture unchanged and re-using some or all of the model weight.

Transferred knowledge is especially useful with the new dataset when it is small and not sufficient to train a model from scratch.

Data Augmentation

Data Augmentation techniques allow you to increase the number of training samples and it is typically used with image data, you take all the images you are working with and perturb and disturb those images in some way to generate new images.

You can perturb these images by applying scaling, rotation, and affine transform. And these image processing options are often use preprocessing techniques to make your image classification models build using CNN or computational neural networks more robust, they can also be used to generate additional samples for you to work with.

Synthetic Data

Synthetic data comes with its own set of problems, basically, you will artificially generate samples that mimic real-world data. You need to understand the characteristic of what data you need.

You can oversample existing data points to generate new data points or you can use other techniques to generate artificial data but it can introduce bias in existing data.

Too Much Data

It might seem strange that too much data is a problem but what is the use of data if it is not the right data. Data might be excessive in two ways:

1- Outdated Historical Data: Too many rows.

Working with historical data is important but how important if you have too much historical data which is not really significant, you might end with something called ‘Concept Drift’.

Concept Drift: The relationship between features (X-variables) and labels (Y-variables) changes over time; ML models fail to keep up, and consequently their performance suffers.

Concept Drift means that the ML model continues to look it the stat of the world that is outdated and no more significant or relevant.

So, if you are working with historical data take the following in your consideration:

If not eliminated, it leads to concept drift.
Outdated historical data is a serious issue in specific when you are working with ML models that work with financial data especially if you are modeling the stock market.
Usually requires human experts to judge which rows to leave out.

2- Curse of dimensionality: Too many Columns.

Your samples which should use them to train ML model and every sample might have too many columns too many features in the simplest form when you deal with the curse of dimensionality you might end up using irrelevant features which really don’t help your model improve.

The Curse of dimensionality is a huge topic that has been studied in detail by data scientists.

Two specific problems arise when too much data is available:

Deciding which data is actually relevant.
Aggregating very low-level data into useful features.

Historical Data is a fairly hard problem to deal with, but the Curse of Dimensionality problems is easier to solve

How?

You can use Feature Selection to decide which data is relevant.

You can use Feature Engineering to aggregate the low-level data into useful features.

You can perform Dimensionality Reduction to reduce the complexity without losing information.

Non-Representative Data

There are several manifestations of non-representative data one is feeding the wrong features to your model and there are other manifestations as well. It is possible that the data you collected has inaccurate data in some way, small error significantly impacts on your model.

Another manifestation of non-representative data is biased data. For example, you are collecting data from 5 sensors from different 5 locations and one of those sensors is not working all the time, your data is biased because you don’t have proportional data from one of the sensors. Working with biased data leads to biased models that perform poorly in practice.

You can mitigate by oversampling and undersampling. So, if you have fewer data from one of the sensors you can oversample from the data that you have. so, you will have a representative sample.

Duplicate Data

If you are collecting data, they might be duplicate. If data can be flagged as duplicate, problem relatively easy to solve by Simply de-duplicate the data before you feed it yo your model.

But the world isn’t that simple, duplicate can be hard to identify ins some applications like real-time streaming. You can just live with it and account for it.

Data cleaning procedures can help significantly mitigate the effect of
- Missing data
- Outliers

This what we will talk about in part 2 : The Imperative of Data Cleansing — part 2