Topic:5 Encoding’s

Brijesh Soni
6 min readFeb 4, 2023

--

feature engineering, machine learning, data science traning
Encoding of Feature Engineering

Encoding is an important part of Feature Engineering, but what is it?

Encoding in feature engineering refers to the process of converting categorical variables (i.e. variables that can take on a limited number of values) into numerical variables that can be used in machine learning models. This is done because most machine learning models are designed to work with numerical data, and cannot handle categorical data directly. There are several different methods for encoding categorical variables, including One-Hot Encoding, Ordinal Encoding, and Binary Encoding.

There are various types of Encoding in Feature Engineering

feature engineering, machine learning, data science traning
Famous encoding techniques

There are several types of encoding used in feature engineering for converting categorical variables into numerical variables:

One-Hot Encoding: This method creates a new binary variable for each unique category in a categorical variable. The new variables take on a value of 1 when the original variable is equal to that category, and 0 otherwise.

Ordinal Encoding: This method assigns an integer value to each category in a categorical variable. The integers are usually assigned in the order of the frequency of the categories, or in alphabetical order.

Binary Encoding: This method encodes categorical variables as binary numbers. It is similar to ordinal encoding, but instead of using integers, it uses the binary representation of those integers.

Others techniques

  1. Count Encoding: This method replaces categorical variables by the count of each category in the dataset.
  2. Target Encoding: This method replaces the categorical variable with the mean of the target variable for each category in the dataset.
  3. Helmert Encoding: This method is used to encode categorical variables that have an ordinal relationship. This method creates a new variable which is the difference between the mean of the target for each level and the mean of the target for the previous level.
  4. Leave one out Encoding: This method replaces the categorical variables with the mean of the target for all samples except the current observation.

It’s important to note that the best encoding method depends on the specific dataset and the machine learning model being used.

How can Categorical & Numerical data be Encoded?

mixed data, feature engineering, machine learning, data science traning
Mixed Data

Categorical data can be encoded using various methods such as One-Hot Encoding, Ordinal Encoding, and Binary Encoding. Numerical data can be encoded using techniques such as Normalization and Standardization.

  1. In One-Hot Encoding, a categorical feature is transformed into multiple binary features, with each feature representing a unique category value.
  2. In Ordinal Encoding, a categorical feature is assigned numerical values based on a categorical order or ranking.
  3. In Binary Encoding, categorical features are transformed into multiple binary features by encoding each digit of a category value as a separate feature.
  4. Normalization involves scaling the numerical data to a specific range, typically between 0 and 1, so that all features have the same scale.
  5. Standardization involves transforming the numerical data so that it has a mean of 0 and a standard deviation of 1. This is useful when the data has a skewed distribution.

Can Machine Learning Models be improved by Encoding?

feature engineering, machine learning, data science traning

In Machine Learning, encoding is an important step in the preprocessing of data as it helps the model to better understand the relationship between features and the target variable. Encoding is particularly important for categorical and ordinal features, as most machine learning algorithms are designed to work with numerical data.

By encoding categorical and ordinal features, the model can treat each category as a separate entity and capture any patterns or relationships between the categories and the target variable. This, in turn, helps to improve the accuracy and performance of the model.

In addition to improving the performance of the model, encoding also helps to prevent bias and overfitting. For example, when using One-Hot Encoding, each category is treated as a separate entity, reducing the risk of overfitting by avoiding the assumption of a linear relationship between categories.

Overall, encoding plays a crucial role in the preparation of data for use in a machine-learning model and can significantly impact the accuracy and performance of the model.

How to implement One-Hot and Binary Encoding in code?

Here is a sample code in Python using the category_encoders library to perform One-Hot and Binary Encoding on a categorical feature:

One_Hot Encoder

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample data
data = {'color': ['red', 'green', 'blue', 'yellow']}
df = pd.DataFrame(data)

# Create an instance of the OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')

# Fit and transform the OneHotEncoder on the 'color' feature
one_hot = encoder.fit_transform(df[['color']])

# The resulting array can be converted into a dataframe for easier analysis
one_hot_df = pd.DataFrame(one_hot.toarray(), columns=encoder.get_feature_names(['color']))

# The original dataframe can be concatenated with the one_hot_df for use in a machine learning model
df = pd.concat([df, one_hot_df], axis=1)

print(df)

Binary Encoder

import pandas as pd
import category_encoders as ce

# Sample data
data = {'color': ['red', 'green', 'blue', 'yellow']}
df = pd.DataFrame(data)

# Create an instance of the BinaryEncoder
encoder = ce.BinaryEncoder(cols=['color'])

# Fit and transform the BinaryEncoder on the 'color' feature
df = encoder.fit_transform(df)

print(df)

Major difference between One-Hot Encoding and Binary Encoding?

feature engineering, machine learning, data science traning
One-Hot vs Binary Encoding

One-Hot Encoding and Binary Encoding are two common techniques used to represent categorical data numerically in a machine-learning model. The main difference between the two is in the number of columns they create for each categorical feature.

One-Hot Encoding creates one column for each unique category value, where each column represents a binary value indicating the presence or absence of a specific category. This results in a sparse matrix with many columns, but each column represents a unique and separate category.

Binary Encoding, on the other hand, represents each category value as a binary code, where each digit of the binary code is represented as a separate column. This results in a much smaller number of columns compared to One-Hot Encoding, but the columns can be harder to interpret.

In general, One-Hot Encoding is preferred when there are a small number of unique categories, as it creates a separate and distinct column for each category. Binary Encoding is preferred when there are a large number of categories, as it reduces the number of columns created and helps to reduce dimensionality.

Ultimately, the choice between One-Hot Encoding and Binary Encoding will depend on the specific use case and the nature of the categorical data.

Summary

Encoding is a crucial step in the feature engineering process of a machine learning model. It is used to represent categorical and ordinal data numerically so that the model can better understand the relationship between the features and the target variable. There are several encoding techniques available, including One-Hot Encoding, Binary Encoding, and others, and the choice of which technique to use will depend on the specific use case and the nature of the data.

By encoding categorical and ordinal features, the model can treat each category as a separate entity and capture any patterns or relationships between the categories and the target variable. This, in turn, helps to improve the accuracy and performance of the model.

It is important to keep in mind that encoding is just one step in the feature engineering process and that other preprocessing steps, such as normalization, scaling, and feature selection, may also be necessary to prepare the data for use in a machine learning model.

If you like my notes, then you should support me to make more such notes.

So, Comming soon for a new topic.

Find me here:

👉 GitHub: https://github.com/Birjesh786

👉 Linkedin: https://www.linkedin.com/in/brijeshsoni007/

👉 Profile Summary: https://sonibri786.wixsite.com/brijeshsoni

--

--

Brijesh Soni

🤖 Deep Learning Researcher 🤖 and Join as Data Science volunteer on @ds_chat_bot 👉👉 https://www.instagram.com/ds_chat_bot/