Preprocessing the data

sri hari
Nerd For Tech
Published in
5 min readMay 19, 2021

In this blog let’s discuss all preprocessing methods. This blog may look big but it’s very effective.

The quality of data determines the performance of machine learning algorithm. The quality refers to the preprocessed data. Hence preprocessing is very essential for building a model.

Here is the methods of preprocessing,

various methods of preprocessing

1.Handling missing values:

Most of the machine learning algorithms don’t support data with null values. So there is a need to handle null values

1.1-Imputation

Imputation is the process of replacing missing data with substituted values.

mean imputation

Is the substituted value is mean, it’s mean imputation

median imputation

and If the substituted value is median its median imputation

mode imputation

and if the substituted value is mode its mode imputation.

1.2-Prediction of columns with missing values as target:

Consider the attribute with missing values as target and remaining as features then predict that target and replace the missing values.

Here, we can predict the height by making height as the target attribute.

1.3-Omitting the columns:

we can omit a particular column when it has more than 50% of missing values.

1.4-Creating a category:

The most common and popular approach is to model the missing value in a categorical column as a new category called “unknown”

creating category

In this you can see missing values are categorized as unknown.

1.5-Go on with algorithms that support missing values:

We can use certain machine learning algorithms which supports missing values in data.

2.Feature selection:

Feature selection is the process where you select features which contribute most to your prediction variable or output in which you are interested in.

Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.

If dataset contains 100 features, we cant process it all. It will consume more time. SO , there is a need to select a important features that contribute most to your prediction variable or output.

2.1-Filter method:

It measures the relevance of features(independent variables) and faster compared to wrapper method as they do not involve training the models.

2.1.1-correlation:

Here as the designation increases the yearly and monthly income increases.

I discussed correlation in separate blog

2.1.2-Chi-Square:

Correlation is about linear relationship between two continuous variables. Chi-square is usually about the independence of two categorical variables.

To know about types of variable visit below blog

It says whether the values of one categorical variable depends on the values of the other categorical variable. Chi-square helps in measure of relationship.

2.1.3-LDA:

LDA is a dimension reduction technique. It reduces the number of dimensions (i.e. variables) in the dataset while retaining as much as information as possible.

It is more beneficial for classification problems.

2.1.4-ANOVA:

ANOVA — Analysis of variance

The variance of a feature variable determines how much it is impacting the response variance.

If the variance is low, it implies there is no impact of this feature on response and vice versa.

2.2-Wrapper method:

2.2.1-Forward selection:

It is the iterative method starting with having no feature. In each iteration, keep adding the feature which best improve our model. If addition of a new variable does not improve the performance of the model, the remove that variable.

forward selection

Here we added each variable to model at each iteration, at 4th iteration when we add c4 , the performance of model gets reduced, so we shouldn’t use that variable, so we skipped.

2.2.2-Backward elimination:

Reverse process of forward selection is backward elimination. It is the iterative method that starts with all the features.

This process removes the least significant variable that reduces the performance of model at each iteration.

This process is to be repeated until no improvement is observed on removal of features.

2.2.3-Recursive feature elimination:

It is a greedy optimization algorithm which aims to find best performing feature subset.

This process repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted.

This methods ranks the features based on the order of their elimination.

reverse feature elimination

2.2-Embedded method:

Lasso and Ridge regression:

It is the simple technique to reduce model complexity and prevent over-fitting.

Lest discuss ridge and lasso in separate blog.

3-Removing duplicates:

While training a supervised algorithm the usual assumptions are that,

  1. Data points are independent and identically distributed.
  2. Training and testing data is sampled from the same distribution.

In light of these assumptions, you should not throw out the identical data points.

4.Data wrangling:

Data wrangling or munging is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time.

Hope this blog give a good idea preprocessing the data. I will try to explain each and every topic in new blog.

Thankyou! :-)

--

--

sri hari
Nerd For Tech

Student from Coimbatore Institute of Technology, R and D engineer trainee