Category Encoders: A Powerful Tool for Data Scientists

Ferdi
7 min readAug 5, 2023

--

Definition

In this article we will discuss Encoders in the Sklearn library for Machine Learning, These encoders convert categorical data into numerical representations, allowing the models to work effectively with non-numeric data. Each encoder has its specific approach to handling categorical data and is selected based on the characteristics of the data and the machine learning algorithm being used.

Types

There are many types of Encoders in Sklearn, here we going to discuss 4 Encoders

  1. Binary Encoder
  2. Base N Encoder
  3. One Hot Encoder
  4. Ordinal Encoder

Binary Encoder

A binary encoder is a digital circuit that converts a specific set of input data into a binary code, where binary means a representation using only two symbols, typically 0 and 1. The binary encoder takes multiple input lines and produces a binary output that represents the active input line or a combination of active input lines, depending on the type of encoder.

This type of encoder usually uses in digital systems to simplify the representation and processing of data. They play a crucial role in various applications, such as address decoding, data multiplexing, priority encoding, and data compression.

Advantages of using binary encoder :

it simplifies the representation and manipulation of information.

Disadvantages of using binary encoder :

Complexity for Large Inputs: As the number of input lines increases, the complexity of the binary encoder grows significantly. For example, a 4-to-16-line encoder requires extensive circuitry to handle all possible input combinations, leading to increased design complexity and higher manufacturing costs.

Using Binary Encoders :

#1. Import all the library that used for encoding. 
!pip install category-encoders
import pandas as pd
import numpy as np
from category_encoders import BinaryEncoder

#2. load the data
df= pd.read_csv('https://raw.githubusercontent.com/ediashta/laptop-price-regression/main/laptops.csv')
df

#3.Do a feature selection.
X = df.drop('Final Price', axis=1)
y= df['Final Price']
X
X.drop(columns='Laptop',inplace=True)

#4.Define the Encoder
BE = BinaryEncoder()

#5.Fit_transform the encoder to data
X_binary = BE.fit_transform(X[num_col])
##* remember to only fit on the train data.

#6. Change to DataFrame
X_binary = pd.DataFrame(X_binary)
X_binary

The data after being encoded with Binary

Encoded with binary encoder

Conclusion :

Binary Encoder change data to binary number usually 0 and 1, it’s a very simple process and quite fast in smaller data, but if u used it in complex data it will consume a lot of time and increase the columns of original data depending on your data also.

Base N Encoder

A base-n encoder is an encoder used to convert a number from its decimal (base-10) representation into another base representation called base-n. Base-n refers to a positional numeral system with n distinct symbols or digits.

In a decimal system, we use ten distinct digits (0 to 9) to represent numbers. However, other bases are also commonly used in computing and mathematics. The most common examples are binary (base-2), octal (base-8), and hexadecimal (base-16).

Advantages of using Base N Encoder :

Base N encoding can represent categorical data in a more compact form compared to one-hot encoding, especially when the number of categories is relatively high. One-hot encoding creates a binary vector for each category, resulting in a sparse matrix. In contrast, base N encoding can represent multiple categories with a smaller number of digits.

Disadvantages of using Base N Encoder :

When using Base N encoding for categorical data, the resulting encoded values do not directly represent the original categories. This makes it harder to understand the meaning of the encoded data. And also, specifically for categorical data usually doesn’t have any natural order, but Base N encoding introduces an artificial numerical order. This means it may wrongly imply that some categories are “bigger” or “better” than others, leading to wrong conclusions when analyzing the data.

Using Base N Encoder :

#1. Import the library from category-encoders 
import pandas as pd
import numpy as np
from category_encoders import BaseNEncoder

#2. Base N Encoder can work both for numerical and categorical data, here we will demonstrate it with only categorical data with the same data as the previous example
baseN_data = pd.DataFrame(X[num_col])
baseN_data

#3.Define the encoder, here we will use 5 as the base total
baseN = BaseNEncoder(base=5)

#4.Fit the encoder to transform the data
X_baseN = baseN.fit_transform(baseN_data)

The data after being encoded with Base N Encoder

data after encoded with Base N Encoder

Conclusion

In summary, base-n encoding involves representing a decimal number using a different set of symbols (digits) based on the desired base. The process includes dividing the decimal number by the base and recording the remainders in reverse order until the quotient becomes zero. The remainders are then combined to form the base-n representation of the original number.

One Hot Encoder

Definition

One-hot encoding is a technique used in machine learning and data preprocessing to convert categorical variables into a numerical format that can be processed by algorithms. It is commonly used when dealing with categorical data, where each value in a categorical variable is represented as a binary vector. Let’s delve into how one-hot encoding works, its advantages, and its disadvantages.

Advantages of One-Hot Encoding:

  • Preservation of Categorical Information: One-hot encoding retains the categorical nature of the data, allowing the algorithm to distinguish between different categories without imposing any ordinality.
  • Compatibility with Algorithms: Many machine learning algorithms require numerical input, so one-hot encoding enables the use of categorical data in these algorithms, such as logistic regression or neural networks.
  • Avoiding Numerical Assumptions: Converting categories into numerical values directly (e.g., assigning 1 to red, 2 to green) may lead the algorithm to assume an inherent order, which might be misleading. One-hot encoding avoids this problem.

Disadvantages of One-Hot Encoding:

  • High Dimensionality: If the categorical variable has many distinct categories, one-hot encoding can lead to a significant increase in the feature space’s dimensionality. This could result in a sparse matrix with many zero entries, which can be computationally expensive and require more memory.
  • Curse of Dimensionality: The curse of dimensionality refers to the increase in the volume of the feature space as the number of dimensions increases. This can lead to overfitting, especially if the dataset is small or the number of categories is large.
  • Collinearity: One-hot encoding introduces perfect collinearity among the encoded features, as each category’s binary vector is fully independent. This can cause multicollinearity issues in certain algorithms like linear regression.
  • Information Loss: One-hot encoding discards information about the relationships between categories, treating them as completely independent, which may not always reflect the true underlying relationships.

Using OneHot Encoder :

#1. Import all the library that used for encoding. 
from sklearn.preprocessing import OneHotEncoder

#2.Define the Encoder
OneHot = OneHotEncoder()

#3.Fit_transform the encoder to data and assign to dataframe
X_onehot = onehot_encoders.fit_transform(X[num_col]).toarray()
X_onehot = pd.DataFrame(X_onehot)

#4.Print Output
X_onehot
data after encoded with one-hot

Conclusion:

One-hot encoding is a useful technique to convert categorical variables into a numerical format for machine learning. It preserves categorical information, ensures algorithm compatibility, and avoids numerical assumptions. However, it can lead to high dimensionality, potential overfitting, collinearity issues, and information loss about underlying relationships between categories. Careful consideration of the dataset’s characteristics is essential when using one-hot encoding.

Ordinal Encoder

The ordinal encoder is a data preprocessing technique used in machine learning to transform categorical variables with ordinal relationships into numerical values. It assigns a unique integer to each category, based on their order or ranking, while preserving the information about their relative order.

Advantages of ordinal encoder:

  1. Simplicity: Ordinal encoding is straightforward to implement and does not require extensive computational resources.
  2. Preserves ordinal relationships: It retains the ordinal information of the categories, which can be beneficial for certain algorithms that can leverage this order in their learning process.
  3. Compact representation: The encoded values are represented as integers, which reduces the memory footprint compared to one-hot encoding.
  4. Robust to new categories: Ordinal encoder can handle unseen categories at inference time by assigning a unique integer to the new category.

Disadvantages of ordinal encoder:

  1. Not suitable for nominal variables: Ordinal encoding assumes an inherent order among the categories, making it unsuitable for nominal variables without any meaningful ranking.
  2. Arbitrary numerical assignments: The assignment of integers to categories is arbitrary and may lead to unintended consequences, such as implying an incorrect magnitude or distance between the categories.
  3. Impact on algorithms: Some machine learning algorithms might misinterpret the encoded values as continuous or ordinal data, leading to potentially incorrect model results.
  4. Information loss: Ordinal encoding discards the original category labels and their individual information, which might be crucial for certain analyses.

Using Ordinal Encoders :

#1.Import all the library that used for encoding.
from sklearn.preprocessing import OrdinalEncoder

#2.Sort the Values from ordinal column
screen = ([10.1, 10.5, 10.95, 11.6, 12.0, 12.3, 12.4, 12.5, 13.0, 13.3,
13.4, 13.5, 13.6, 13.9, 14.0, 14.1, 14.2, 14.4, 14.5, 15.0,
15.3, 15.4, 15.6, 16.0, 16.1, 16.2, 17.0, 17.3, 18.0])

#3.Define ordinal encoder
ordinal_encoder = OrdinalEncoder(categories=[screen])

#4.Fit and Transform the coresponding column

X_ordinal = ordinal_encoder.fit_transform(X[['Screen']])

#Print Output
X_ordinal
data after being encoded with an ordinal encoder

Conclusion:

Ordinal encoding is a simple and efficient technique for preprocessing categorical variables with ordinal relationships. It preserves the ordinal information and reduces memory usage compared to one-hot encoding. However, it is not suitable for nominal variables and may lead to arbitrary numerical assignments. It can also impact the interpretation of algorithms and result in information loss. Therefore, it is important to carefully consider the nature of the data and the requirements of the machine-learning task before applying ordinal encoding.

Writers :

  1. Dwi Putra Satria Utama
  2. Ediashta Revindra Amirussalam @ediashtar
  3. Ferdiansyah Ersatiyo @fersatyo
  4. Zidny Yasrah Sallum @zidnyyasrah

--

--