Feature Engineering Mastery: Elevate Your Machine Learning Models.

Table of contents:

Paresh Patil
6 min readJul 3, 2023

Types of feature engineering
· 1. Feature Transformation:
I) Missing value imputation:
II) Handling categorical values:-
III) Outlier Detection:
IV) Feature Scaling:
· 2. Feature Construction:
· 3.Feature Selection:
· 4. Feature extraction:

Feature engineering is the process of using domain knowledge to extract features from raw data. These features can be used to improve the performance of ML algorithms.

If you have data, it does not mean you can give that data directly to the machine learning algorithm; you need to prepare the data for the ML model. Only then do you get good results.

when you create such data from raw data, that data will be consumed by ml model This process is called feature engineering

Feature engineering is an important part of MLDLC. It is something that is not exactly science. It means that there is no exact process for this. It is an art.

Types of feature engineering

  1. Feature Transfromation
  2. Feature Construction
  3. Feature Selection
  4. Feature Extraction

1. Feature Transformation:

Imagine you have a box of different toys, and you want to find out which ones are the most fun to play with. But just looking at the toys might not give you all the information you need. So, you decide to transform or change the toys in some way to make them easier to understand.

For example, you might measure how big or small each toy is, or how many colors it has. These measurements are called features. But sometimes, these measurements alone may not be enough to tell you which toys are the best. So, you decide to do some tricks with the features to make them more helpful.

One trick you can do is change the size of the toys so they all fit in the same range. This way, you can compare them more easily. Another trick is to change the colors of the toys into numbers, like saying red is 1 and blue is 2. This makes it easier for the computer to understand the colors.

You can also do other tricks, like combining different features together, or finding patterns in the features that are not so obvious. All these tricks help you understand the toys better and figure out which ones are the most fun to play with.

So, feature transformation is like doing magic tricks with the information about the toys to make it easier for us to find out which ones are the best. It helps us see the patterns and relationships between the toys, so we can make better decisions.

I) Missing value imputation:

When you will work with real-life data.You will realize that the data you are working with has missing values.It can be due to

  • Data is not being intentionally filled, especially if it is an optional field.
  • Data being corrupted.
  • Human error.
  • If it was a survey, participants might quit the survey halfway.
  • If data is being processed automatically by computer applications, then a malfunction could cause missing data. e.g., a sensor recording logs is malfunctioning.
  • Fraudulent behavior of intentionally deleting data.

The Scikit-Learn library does not accept the missing value while training. before training model, either remove the missing values or fill in the missing Value

The first step in the feature engineering pipeline starts here. you fill the missing values There are different ways to fill in the missing values.

If you remove the missing values because there are fewer missing values, it will not affect the data.If there are more missing values, you fill them by replacing them with the mean,median .If the data is categorical, you replace it with the most frequent category.

II) Handling categorical values:-

Consider the following dataset, which has a column called “Animal”. In this column, there are some categorical values: dog,cat, sheep, Horse, lion

The problem with this kind of data is that Skikit Learn can handle only numerical data. so you need to convert categorical values to numerical values.

III) Outlier Detection:

Outliers are observations in a dataset situated at an abnormal distance from other values in the exact same dataset, Outliers are risky.outlier impact on results of model you can understand it by following example

IV) Feature Scaling:

Sometimes what happens is that your output features have different scales. consider following dataset,focous on age & salary

Age is in the range of tens, and salary is in the range of thousands. Imagine you are working on an algorithm. which works by calculating the Euclidian distance between two points (KNN)

Feature scaling is important in KNN to prevent features with larger magnitudes from dominating the distance calculation, ensure fair comparisons between features, improve convergence speed, and reduce the influence of outliers.

Generally, before feeding data to the model, you Scale the features. It means bringing features on the same scale. Generally, the range lies between -1 to 1

2. Feature Construction:

the best example of feature construction in the titanic dataset

you must have observed there were two columns: “sibsp” and “parch.” Usually, in feature construction, you create new columns based on your domain knowledge

“sibsp” tells how many siblings and spouses you are traveling with, and “patch” tells how many parents and children you are traveling with.

In this scenario, what you can do is add them up and create a new column called “family” which represents the number of family members traveling with

3.Feature Selection:

The best example of feature selection is the MNIST dataset. The MNIST dataset is a collection of 70,000 labeled images of handwritten digits (0–9). It is commonly used as a benchmark dataset for machine learning and computer vision tasks. Each grayscale image has a resolution of 28x28 pixels and is used for training and testing models in tasks such as digit recognition and image classification.

They have converted images to table, where every row is one image. In this dataset, there are 784 features. This huge dimensional ML model takes time train and run

If I do not consider all pixels. I can consider only those pixels that are important.the pixels which are in centre

4. Feature extraction:

Consider that you have a dataset called “Real estate dataset.”

if you want to remove columns from this dataset.You cannot do it easily. To govern the price of a house.It depends on the number of rooms and bathrooms, both of are important.

Rather than taking both rooms and washrooms, just take the area in sq. ft. because, at the end, by combining room and washroom, it increases the area of the flat.Rather than going for two features, go for one feature that is sq. ft.

Thank you for joining us on this feature engineering journey. We hope you found valuable insights and practical techniques to enhance your machine learning projects.

Connect with me:

LinkedIn: https://www.linkedin.com/in/pareshpatil122/

GitHub: https://github.com/paresh122

Portfolio: https://pareshpatil-portfolio.netlify.app/

Happy feature engineering!

--

--

Paresh Patil

Data wizard, blending science and analysis, conjuring insights to fuel innovation and drive data-driven excellence