Feature Engineering for Machine Learning: A step by step Guide(Part 1)

Elakiya Sekar
kgxperience
Published in
8 min readJul 26, 2023

“You can have Data without Information but You cannot have Information without Data” — Daniel Keys Morgan

👩‍🔬 Data scientists, assemble! 💻

We’re talking about feature engineering today, which is the most important step in the data science process after data collection. 📊

Why is it so important? 🤔 Well, think of your data as a big, beautiful cake. 🎂 If you don’t clean and process it properly, it’s going to taste like crap💩.

But if you take the time to engineer the features, you can create a delicious and nutritious cake that everyone will love🍰.

Feature Engineering(credits: serokell)

But first, what exactly is Feature Engineering?

  • Feature engineering is like baking a cake🎂. You need to start with good ingredients (data) 📊, and then you need to mix them together in the right way (feature engineering) 🔀. If you do it right, you’ll have a delicious cake (a good model) 🤩

Looks yummy, right!!? 🤤

OKAY FOCUS 🧠

“Feature engineering is the process of extracting meaningful features from raw data”

We can experiment with different features based on our domain knowledge or our understanding of the data. However, before we can begin, we need to familiarize ourselves with the data.📚🔍

So how do you engineer features? 🤔

Well, there are a few different ways to do it.

  • Feature Transformation🔄
  • Feature Construction🏗️
  • Feature Selection🎯
  • Feature Extraction🔍

Feature Transformation🔄

Feature transformation is the process of making your data more attractive to machine learning algorithms. It’s like taking your data out on a date, and trying to make it look its best. 💃

“Feature transformation is the process of modifying features to make them more suitable for machine learning algorithms”

This includes handling missing values🔍, converting categorical features to numerical values🔢, detecting outliers🎯, and scaling features to a common range📏.

1. Handling Missing Values🔍

Now we gonna discuss about missing values, which are like the uninvited guests at a party.🎉 They can crash your data and ruin your model if you’re not careful.

So what do you do about them? Well, there are two main approaches:

  • Imputation: This is like filling in the blanks with estimates. You can use the mean, median, or mode of the other values, or you can use some logic to fill in the blanks.
df.fillna(0)
  • Deletion: This is like kicking the uninvited guests out. You can remove the rows or columns with missing values.
df.dropna(inplace=True)

Which approach is best?🤔 It depends on the data📊. If there are a lot of missing values, you might want to use imputation🛠. But if there are only a few missing values, you might want to delete them. 🧹

2. Handling categorical values🗂️

Data can be divided into two types: numerical and categorical🗂️. Categorical data can be divided into nominal and ordinal data 📝. There are different ways to convert categorical data to numerical data, depending on the type of data. This process is called encoding.

“Encoding refers to the process of converting categorical data into a numerical format”

  • Nominal data is categorical data without any order, such as states or branch(like your domain). It can be converted to numerical data using one-hot encoding, which is another Python library called OneHotEncoder().
  • Ordinal data is data that has an order, such as the level of grades (A+, A, B+, B, C). It can be encoded using ordinal encoding, which is a Python library called OrdinalEncoder().

The above two encoders can be used for explanatory variables (x). For prediction variables (y), we should use LabelEncoder(). Label encoding is specially designed for output variables.

3. Handling Outliers🎯

Outliers are like the weird kids in school🤪. They’re different from the rest of the data, and they can make your model look bad.

Outliers are data points that are significantly different from the rest of the data set. They can affect the accuracy of our model.

Detecting with IQR(Inter Quartile Range)

There are two main ways to treat outliers:

  • Trimming✂️: It is like cutting the weird kids out of the school photo. You can remove the outliers from the data set.
  • Capping🎞️: This is like putting the weird kids in the back of the photo. You can replace the outliers with values that are within the range of the rest of the data.

If the outliers are a small number, you might want to trim them✂️. But if the outliers are a large number, you might want to cap them🎞️.

There are a number of methods that can be used to detect and remove outliers, including z-score, IQR, percentile, and Winsorization.

4. Feature scaling📏

Feature scaling is like putting everyone on the same playing field. 🏟️🎯 In a machine learning model, all features are not created equal. Some features may have a much larger range than others, which can give them an unfair advantage. 😲📈 Feature scaling helps to level the playing field by ensuring that all features have the same scale. 🎚️📊

Feature scaling is a process of transforming the features in a dataset so that they have a common scale.”

This helps to prevent certain features from dominating the model.

There are two main types of feature scaling:

  • Standardization⚖️: Standardization is a process of subtracting the mean from each feature and then dividing by the standard deviation. This ensures that each feature has a mean of 0 and a standard deviation of 1. Standardization is often used for data that follows Gaussian distribution, such as linear regression and logistic regression.
  • Normalization📏: Normalization is a process of rescaling the features so that they lie between a certain range, such as 0 and 1. This is often used for data that does not follow Gaussian distribution, such as decision trees and support vector machines.

Feature scaling is an important step in the machine-learning process. By scaling the features, you can help to improve the performance of your model and make sure that all features are given a fair chance🤝.

Feature Construction🏗️

Feature construction is like decorating a cake🍰. You start with a basic cake, but then you can add all sorts of decorations to make it more delicious and interesting🎉. For example, you might add chocolate chips, nuts, or fruit🍪.

But be careful not to go overboard! 🚫 Too many decorations can make the cake look messy and unappealing🤯. The same is true for feature construction. If you add too many features to your data set, it can become difficult to interpret and analyze .

So, when you’re feature constructing, remember the golden rule of cake decorating: less is more🏆. A few well-chosen features can make your data set more informative and relevant, while too many features can make it less useful.

“The process of developing new features from existing features or upon our domain knowledge is known as feature construction.”

By making the features more informative and relevant to the task at hand, this helps machine learning models perform better.

There are numerous ways to build features, but some typical techniques include:

  • Repurposing existing features: This is like remixing old songs🎵. You can take existing features and combine them in new ways to create something new and interesting🎶. Combine, alter, or create new features from existing ones. For example, you could combine the features “sibsp” and “parch” in the Titanic dataset to create a new feature called “family”.
  • Using domain expertise: This is like consulting a chef👩‍🍳. You can use your understanding of the domain to create new features that are important to the task at hand🍴. Create new features that are important to the task at hand based on your understanding of the domain. For example, if you are developing a model to predict customer churn, you could add a new feature called “number of months since last purchase” if you know that customers who haven’t made a purchase in a while are more likely to churn.
  • Using feature selection algorithms: This is like hiring a personal shopper👗. You can use these algorithms to determine the most important features from a set of data, and then build new features based on those features.It’s like curating the best attributes to create powerful new ones! 💪🌟

CONCLUSION🧁

In conclusion, feature engineering is a crucial step in the data science process that involves transforming, constructing, selecting, and extracting meaningful features from raw data. Just like baking a delicious cake, we start with good ingredients (data) and use our expertise to create new and relevant features that enhance the performance of our machine learning models. 🍰

In the first part of this blog, we explored the importance of feature transformation, handling missing values, encoding categorical data, dealing with outliers, and the significance of feature scaling. We also discussed the art of feature construction, where we creatively combined existing features and leveraged domain knowledge to build informative attributes.

Remember, the key to successful feature engineering is finding the right balance. 🎯 Adding too many features can make the data messy, while a few well-chosen features can greatly improve model performance.

Stay tuned for the second part of this blog, where we will delve further into advanced feature engineering techniques and real-world examples. 💻🚀

So keep your data scientist hats on, and get ready for more exciting insights and knowledge in the upcoming sequel! 🎩📚

Feel free to Connect🎯

➡️https://www.linkedin.com/in/elakiya-sekar-28465b220/

➡️https://www.instagram.com/elakiya__sekar/

➡️meelakiya24@gmail.com

--

--

Elakiya Sekar
kgxperience

Meet me, Elakiya Sekar! I'm all about everything... or maybe not! But, until my interests change, I'll hook you up with rad reads! Stay tuned!