Feature Engineering

Published in

featurepreneur

3 min readOct 8, 2022

Are you a data scientist ready to build a model but struck with raw data?

Well, Feature Engineering is the process of extracting features that benefit the model to make it efficient.

Feature engineering is about creating new input features from your existing ones. In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition.

It includes various steps according to the model and the data you are working on.

Some of the important steps are discussed below:

1) Dimensionality Reduction (PCA):

First, get a correlation between similar variables

If it’s more than 0.9 means its similar so all of it isn’t required for prediction

Therefore those columns similar can be dropped or even clubbed.

ex: In real estate analysis,

Number of bedrooms, Number of rooms — they are directly related

If one increases automatically the other increases, therefore, one of the columns is enough for the model.

Principal Component Analysis

1)Unsupervised statistical technique that is used to reduce the dimensions of the dataset

2)Involves the transformation of variables in the dataset into a new set of variables which are called PCs (Principal Components).

2) Preprocessing/Scaling: (Mainly for neural models)

This can be done in 3 ways:

1)StandardScaler library from sklearn

Values would be converted in such a way that they are in the closer range

Scaling would be done and data would be saved.

2)Normaliser

3)MinMaxScaler

3)Categorical Encoding(Dummy/One-Hot):

Encoding is the process of applying a specific code, such as letters, symbols and numbers, to data for conversion into an equivalent cipher

When working on some datasets, we found that some of the features are categorical, if we pass that feature directly to our model, our model can’t understand those feature variables. We all know that machines can’t understand categorical data. Machines require all independent and dependent variables i.e input and output features to be numeric. This means that if our data contain a categorical variable, we must have to encode it to the numbers before we fit our data to the model.

4)Binning:

Binning is a process of Grouping or aggregating the data.

5)Clustering:

Data scientists and others use clustering to gain important insights from data by observing what groups (or clusters) the data points fall into when they apply a clustering algorithm to the data. By definition, unsupervised learning is a type of machine learning that searches for patterns in a data set with no pre-existing labels and a minimum of human intervention. Clustering can also be used for anomaly detection to find data points that are not part of any cluster, or outliers.

6)Feature selection:

Combinations of random features can be done to make the data more efficient for the model.

These are some of the few techniques in feature engineering.

Feel free to add these to your model so that it’s accurate and doesn’t involve false positives.

Happy coding!