Published in

Data Cleaning/Preprocessing Cheatsheet

This post summarizes the data clearing and emphasizes its importance. It gives you ideas about data cleaning through manual and automatic approaches like RANSAC, and finally, with morphology on images, which cleans binary images, finishing with concepts like normalization, standardization, and regularization. For keeping this post like a glimpse, we will review dimensionality reduction techniques like PCA, in future posts.

The importance of clean data

Data plays the blood role in the machine learning programming paradigm. For example, consider the regression. In regression, we want to fit a curve or line over our data. Then, we will use the formula of that curve or line as a predictor for unseen data. See the following figure that the data are clean.

However, for the following figure, the data contain outliers that influence the model adversely.

An outlier is an observation that lies at an abnormal distance from other values in a random sample of a population. Good machine learning practitioners have the famous saying in their head recurring all the time:

Garbage In Garbage Out

In data cleaning, outliers are not our only challenges. Noise and imbalanced datasets make lives hard. Usually, we scale our data to rely on learning the variance (changes), not the actual values.

The following figure shows how noise can affect what we learn. So, denoising should be considered in data prone to noise.

When we have imbalanced data, our model will learn imbalanced patterns. The following figure shows how it happens. In the imbalanced data case, you can see that the learned boundaries for classification are not correct. You can see in the figure the word “synthetic”. It is a way to increase the amount of data and avoid imbalance issues by filling the space that the data spans over.

The way that data are chosen for training, validation, and testing subsets is essential for having trustable predictions. If we don’t partition them in a balanced way, the approach is deceitful and can not be representative. The following figure shows how the imbalanced data can affect the learned classification boundaries.

Splitting Data into Training, Validation, and Testing sets

The reason for splitting data into two groups, training, and testing, is self-explanatory. The reason is that we train with a bunch of data, we should not test it with it because it has learned it already. But, why do we have a validation set? The reason for having a validation set is to test the model between epochs and decide on whether to finish the training and make sure that the model is not overfitting on the data (overfitting happens when the model memorizes all data points — validation loss increases but training loss is decreasing drastically). After putting an end to the training phase, the test part is used to evaluate the model.

It is very important that when these partitions are made for the data, they also should be balanced and representative. I mean consider our data spans over a space so our samples should be scattered uniformly over that space. We should not take data just from a specific part of our data space and do stuff.

Manual Data Cleaning/ Processing

In this method, the data scientist, responsible for the data, sits down, looks at the data, knows it, visualizes it, then based on the data defections decides to take specific actions. For example, let’s consider we have an excel file of data. But, some of our samples (rows) lack some column values, we can exclude those rows or put the average of that column in them, or put the average of the columns of the parallel rows. But, it is obvious how time-consuming it can be when the size of the data dramatically goes up.

Automatic Data Cleaning/ Processing

In these approaches, we rely on algorithms in the heart of some computer programs to go and process the data for us, like do imputation for missing parts of data, or removing outliers.

RANdom SAmple Consensus (RANSAC)

This approach aims at excluding outliers from the dataset using the chosen model. This algorithm can be considered as a loop of three steps as follows:

  1. Randomly sample the number of points required to fit the model
  2. Solve for model parameters using samples
  3. Score by the fraction of inliners within a preset threshold of the model

Consider the following snippet that does RANSAC on a model and data. The points are data representing the points that specify the border of a pupil in an image. These data are extracted from images by applying image processing techniques. As pupils look like ellipses, the ellipse model is used as the model for testing RANSAC. This specific one will go for 10 iterations. 60% of the data is chosen randomly. Based on that subset, the ellipse model’s parameters are generated (see the model. estimate). Then, we need to check different subsets' deviation from the model fitted for that subset. The distance that we set there will be criteria for scoring, the subset with the least number of distance deviations.

The snippet was for showing the idea from a lower level. However, the RANSAC can be used from skimage.measure [link].

Techniques for cleaning binary images

In traditional machine learning, not deep learning (using convolutional neural networks (CNN)), for extracting data, some special image processing techniques are used for extracting the information that can be used for training models and doing estimations.

For having a binary image, first, a color image (with red, green, and blue matrices) should transform to grayscale to just have one matrix of data. Then, thresholding that grayscale image will give out a binary image. In thresholding an image, all pixels of an image are compared to a threshold, then based on the result, they are set to zero (black) or not changed.

pixel value = 0 : black | pixel value = 255 : white

The aforementioned techniques are called “Morphological image processing techniques.” The basic morphological operations are dilation and erosion. These are based on the hit and fit concepts. On moving a structural element (consider filter on images) if all pixels cover the pixels of the structural element fit happens, and when at least one of the pixels is covered, we have hit. In the following example, A is fit B is hit, and C is a miss.

Erosion. The output of erosion is one if only the structural element fits the image. So, eroding a binary image with a structural element like a square will make the shape in the binary image smaller.

Erosion can be useful for splitting apart joined objects and striping away extrusions.

Erosion effect on shapes

Dilation. This function’s output is one when the structural element fits the image. So, it enlarges the shape in the binary image.

Dilation is used for repairing breaks in shapes.

Dilation effect on shapes

There are compound operations: Opening, Closing. The opening is doing erosion then doing and dilation. It keeps the original shape and removes small objects.

The closing is doing dilation and then doing erosion.

The following example shows how we can get the cat out of the image. We assume that all our images will work with these thresholds. You may argue that this work may work for this specific image. The answer is yes. However, the example shows how to use morphological operators. Consider we have a fixed camera to a specific part of a yard for counting the number of the passings of white cats.

First, we change it to a grayscale image with the following snippet.

Second, we threshold the grayscale image:

We should get rid of those dots and make the cat more outcoming. So, we erode the image with an ellipse kernel:

Then, for filling the eyes and joining the neck and other parts, let’s use dilate function:


Data Normalization, Standardization, Regularization?!

If we use our data’s absolute values meaning the values that we get from observations. Different features would have different changing ranges. The features with a larger range and values can change the way how the machine learning model learns patterns in data (changes the model’s parameters). For example, for gradient descent, data x is present as follows:

So, this will affect the step size of the gradient descent. The difference in ranges of features will cause different step sizes for each feature. Another good example would be distance algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity.

On the other hand, some other models like tree-based ones are insensitive to the range.

Normalization. is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling [scikit-learn link for this goal].

Standardization. is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation [scikit-learn link for doing this task].

It is recommended to train our models on raw, normalized, and standardized versions of our data sets and see which one offers better results. However, standardization must be used when the data has a Gaussian (normal) distribution.

Regularization. is not a mechanism for doing anything with data. It is a method for doing some changes to the model to avoid overfitting. In other words, this technique discourages learning a more complex or flexible model, to avoid the risk of overfitting [reference].

References and Reading More Material



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store