DATA PREPROCESSING TECHNIQUES

Keerthana
AlmaBetter
Published in
8 min readJun 6, 2021

Data preprocessing is a Data Mining method that entails converting raw data into a format that can be understood. Real-world data is frequently inadequate, inconsistent, and/or lacking in specific activities or trends, as well as including numerous inaccuracies. This might result in low-quality data collection and, as a result, low-quality models based on that data. Preprocessing data is a method of resolving such problems.

Machines do not comprehend free text, image, or video data; instead, they comprehend 1s and 0s. So putting on a slideshow of all our photographs and expecting our machine learning model to learn from it is probably not going to be adequate.

Data Preprocessing is the step in any Machine Learning process in which the data is changed, or encoded, to make it easier for the machine to parse it. In other words, the algorithm can now easily interpret the data’s features.

Data Preprocessing can be done in four different ways. Data cleaning/cleaning, data integration, data transformation, and data reduction are the four categories.

Data Cleaning :

Data in the real world is frequently incomplete, noisy, and inconsistent. Many bits of the data may be irrelevant or missing. Data cleaning is carried out to handle this aspect. Data cleaning methods aim to fill in missing values, smooth out noise while identifying outliers, and fix data discrepancies. Unclean data can confuse data and the model. Therefore, running the data through various Data Cleaning/Cleansing methods is an important Data Preprocessing step.

(a) Missing Data :

It’s fairly common for your dataset to contain missing values. It could have happened during data collection or as a result of a data validation rule, but missing values must be considered anyway.

  1. Dropping rows/columns: If the complete row is having NaN values then it doesn't make any value out of it. So such rows/columns are to be dropped immediately. Or if the % of row/column is mostly missing say about more than 65% then also one can choose to drop.
  2. Checking for duplicates: If the same row or column is repeated then also you can drop it by keeping the first instance. So that while running machine learning algorithms, so as not to offer that particular data object an advantage or bias.
  3. Estimate missing values: If only a small percentage of the values are missing, basic interpolation methods can be used to fill in the gaps. However, the most typical approach of dealing with missing data is to fill them in with the feature’s mean, median, or mode value.

(b) Noisy Data:

Noisy data is meaningless data that machines cannot interpret. It can be caused by poor data collecting, data input problems, and so on. It can be dealt with in the following ways:

  1. Binning Method: This method smooths data that has been sorted. The data is divided into equal-sized parts, and the process is completed using a variety of approaches. Each segment is dealt with independently. All data in a segment can be replaced by its mean, or boundary values can be used to complete the task.
  2. Clustering: In this method, related data is grouped in a cluster. Outliers may go unnoticed, or they may fall outside of clusters.
  3. Regression: By fitting data to a regression function, data can be smoothed out. The regression model employed may be linear (with only one independent variable) or multiple (with numerous independent variables) (having multiple independent variables).

Data Integration

It is involved in a data analysis task that combines data from multiple sources into a coherent data store. These sources may include multiple databases. Do you think how data can be matched up ?? For a data analyst in one database, he finds Customer_ID and in another he finds cust_id, How can he sure about them and say these two belong to the same entity. Databases and Data warehouses have Metadata (It is the data about data) it helps in avoiding errors.

Data Transformation

This stage is used to convert the data into a format that can be used in the mining process. This is done in the following ways:

1. Normalization:

It is done to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Concept Hierarchy Generation:

Using concept hierarchies, low-level or primitive/raw data is substituted with higher-level concepts in data generalization. Categorical qualities, for example, are generalized to higher-level notions such as street, city, and nation. Similarly, numeric attribute values can be translated to higher-level concepts like age, such as youthful, middle-aged, or elderly.

3. Smoothing

Smoothing works to remove the noise from the data. Such techniques include binning, clustering, and regression.

4. Aggregation

Aggregation is the process of applying summary or aggregation operations on data. Daily sales data, for example, might be combined to calculate monthly and annual totals. Feature Aggregation — If the features are highly correlated or if the features can be aggregated into another single feature then it is worth doing it. For example in the dataset you have the height and width of an object then they can be featured into a single feature area. This decreases dimensionality. These types of features are highly correlated in nature as a result it also decreases multicollinearity.

Data Reduction:

Because data mining is a methodology for dealing with large amounts of data. When dealing with large amounts of data, analysis becomes more difficult. We employ a data reduction technique to get rid of this. Its goal is to improve storage efficiency while lowering data storage and analysis expenses.

1. Dimensionality Reduction :

A huge number of features may be found in most real-world datasets. Consider an image processing problem: there could be hundreds of features, also known as dimensions, to deal with. As the name suggests, dimensionality reduction seeks to minimize the number of features — but not just by selecting a sample of features from the feature set, which is something else entirely — Feature Subset Selection or feature selection.

2. Numerosity Reduction:

Data is replaced or estimated using alternative and smaller data representations such as parametric models (which store only the model parameters rather than the actual data, such as Regression and Log-Linear Models) or non-parametric approaches (e.g. Clustering, Sampling, and the use of histograms).

If the data has text and images the preprocessing is a little different. The below sections talk about the preprocessing of text and image data

Preprocessing of Text data:

Preprocessing the text data is a very important step while dealing with text data because the text at the end is to be converted into features to feed into the model. The objective of preprocessing text data is that we won't get rid of characters, words, others that don’t give value to us. We want to get rid of punctuations, stop words, URLs, HTML codes, spelling corrections, etc. We would also like to do Stemming and Lemmatization so that in features duplication of words is not there which convey almost the same meaning.

Steps to perform for text pre-processing

  • Read the text— Read the text data and store it in a variable
  • Store in the list — Using df.tolist() store the sentences in a list.
  • Initialize the Preprocess object and pass techniques*
  • Iterate through the list to get the processed text.

For the after reading text data we will apply Preprocess object present in preprocessing

Preprocessing of Image data:

The term “image pre-processing” refers to actions on images at the most basic level. If entropy(degree of randomness) is an information metric, these methods do not improve image information content, but rather decrease it. Pre-processing aims to improve image data by suppressing unwanted distortions or enhancing particular visual properties that are important for subsequent processing and analysis.

Steps to perform for image pre-processing

  • Read image — Read the images
  • Resize image — Resize the images because the image size captured and fed to the model is different. So it is good to establish a base size and resize the images
  • Remove noise(Denoise) — Using Gaussian blur inside the function processing() we can smooth the image to remove unwanted noise.
  • Segmentation &Morphology(smoothing edges)— We will segment the image in this stage, separating the background from foreground objects, and then we will refine our segmentation with more noise removal.

There are 4 different types of Image Pre-Processing techniques and they are listed below.

  1. Pixel brightness transformations/ Brightness corrections
  2. Geometric Transformations
  3. Image Filtering and Segmentation
  4. Fourier transform and Image restauration

In this blog I will talk about Pixel brightness transformations:

Pixel brightness transformations :

The most common Pixel brightness transforms operations are

  1. Gamma correction or Power Law Transform
  2. Histogram equalization
  3. Sigmoid stretching

g(x)=αf(x)+β, alpha and beta control contrast and brightness of the image.

1.Gamma correction — It is a non-linear adjustment to individual pixel values.

2. Histogram equalization — It is a contrast enhancement technique.

3. Sigmoid Stretching —It has a contrast factor ‘C’ and a threshold value where we may manage the overall contrast enhancement by lighting and darkening the image.

That’s all for this blog !! Hope the blog helped to picture the data preprocessing techniques. Do support if it helped. See you next time with another blog!!!

--

--