Encoding Techniques for Categorical Attributes

Saksham Saxena
9 min readNov 3, 2022

--

When performing a classification analysis, the dependent variable is commonly affected by both qualitative (nominal scale) and ratio scale variables. It is required to encode these categorical variables into numerical values using encoding techniques because machine learning algorithms only take numerical inputs. This blog describes nine different categorical variable encoding methods that can be applied to a categorical dataset.

The different types of encoding techniques are:

  1. One Hot Encoding
  2. Label Encoding
  3. Multi label Binarizer
  4. Leave One Out Encoding
  5. Hashing Encoding
  6. Weight of Evidence
  7. Helmert Encoding
  8. Cat Boost Encoding
  9. James Stein Encoding
  10. M Estimator Encoding
  11. Sum Encoder

ONE HOT ENCODING

The most popular encoding approach is known as One Hot Coding. Every level of the categorical variable is compared to a predetermined reference level. A single variable with n observations and d different values is converted into d binary variables with n observations each using a single hot encoding. Each observation shows whether the dichotomous binary variable is present or absent.

Dataframe:

After One Hot encoding

The presence of 0 and 1 delineates the absence and presence of a particular gender respectively for different rows of a Dataframe

Code:

new_df=pd.get_dummies(columns=[‘Sex’], data=df)

‘get_dummies’ is the function that converts the ‘Sex’ column into dummy variables

2. Label Encoding:

Each label of a categorical data variable is allocated to the most appropriate integer number in the label encoding technique. Although this approach is quite straightforward, it might be challenging to identify the appropriate assignment for a given issue (especially with categorical variables representing unordered data). While label encoding transforms the data into machine-readable form, it also gives each class of data a distinct number (beginning at 0). This could cause priority concerns to emerge during the training of data sets. A label with a high value could be given more priority than one with a low value.

Dataframe:

After Label Encoding:

0,1,2 represents Female, Male, Non/binary respectively

Code:

from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()

df[‘Sex’]=le.fit_transform(df[‘Sex’])

‘LabelEncoder ’ is the function that encodes categorical column ‘sex’ to numerical value.

3.Label Binarizer

In a matrix where each variable is represented as a column, LabelBinarizer converts each variable to binary. In other words, it will convert a list into a matrix with exactly as many columns as there are unique values in the input collection.
The output matrix, which has each of the input labels [1, 10, 12] as a column, is a three-column matrix. then it will be indicated (binary) whether your instances (observations) relate to label 1 or 10 or 12 if they are any of the numbers 1, 10, or 12.

Dataframe:

After Label Binarizer

Label Binarizer is similar to One hot encoding, the resultant is in the form of an np.array rather than a Dataframe

Code:

from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
new_df[‘Sex’]=lb.fit_transform(df[‘Sex’])

‘LabelBinarizer ’ is the function that encodes categorical column ‘Sex’ to an np.array.

4.Leave one out Encoding

When using Leave One Out encoding, all the records with the same value for the target categorical feature variable are essentially averaged to determine the mean of the target variables. Between the training data set and the test data set, the encoding algorithm differs slightly. The record being considered is excluded from the training data set, thus the name “Leave One Out.” The definition is slightly different for validation data or a set of prediction data. The randomization factor is not necessary, nor do we have to disregard the present record.

The encoding for a specific value of a specific categorical variable is as follows.

ci = (Σj != i tj / (n — 1 + R)) x (1 + εi) where
ci = encoded value for ith record
tj = target variable value for jth record
n = number of records with the same categorical variable value
R = regularization factor
εi = zero mean random variable with normal distribution N(0, s)

Dataframe:

After Leave one out Encoding:

Column ‘value’ represents the mean value of the particular department

Code:

Creating the Dataframe

import pandas as pd;
data = [[‘1’, 120], [‘2’, 120], [‘3’, 140],
[‘2’, 100], [‘3’, 70], [‘1’, 100],[‘2’, 60],
[‘3’, 110], [‘1’, 100],[‘3’, 70] ]
df = pd.DataFrame(data, columns = [‘Dept’,’Yearly Salary’])

Encoding the department column

import category_encoders as ce
tenc=ce.TargetEncoder()
df_dep=tenc.fit_transform(df[‘Dept’],df[‘Yearly Salary’])
df_dep=df_dep.rename({‘Dept’:’Value’}, axis=1)

Joining the old and new dataframe

df_new = df.join(df_dep)

category_encoders’ is the library that encodes the categorical column

5. Hashing

When a hash function is used, a string of characters is converted into a single, unique hash value. This method is highly helpful because it uses little memory and can handle more categorical data. For managing sparse, high-dimensional features in machine learning, feature hashing is a potent approach. It is suitable for use in online learning scenarios and is quick, easy, memory-efficient, and speedy. It is an approximation, but in many machine learning problems, the accuracy tradeoffs are surprisingly modest.

The fundamental concept underlying feature hashing is rather straightforward: rather than retaining a one-to-one mapping of categorical feature values to positions in the feature vector, we utilize a hash function to locate the feature in a vector of dimensions.

One-hot encoding, for instance, would provide each potential feature value a unique index in the feature vector, translating a feature with one million possible values to a vector size of one million (that is, one index into the vector for each feature value). The same feature, which has a million possible values, might be hashed into a vector with a substantially smaller size, such as 100,000 or even 10,000, if we utilize feature hashing.

Dataframe:

Github link:

After hashing:

Code:

from sklearn.feature_extraction import FeatureHasher
# n_features contains the number of bits you want in your hash value.
h = FeatureHasher(n_features = 3, input_type =’string’)
# transforming the column after fitting
hashed_Feature = h.fit_transform(df[‘nom_0’])
hashed_Feature = hashed_Feature.toarray()
df = pd.concat([df, pd.DataFrame(hashed_Feature)], axis = 1)
df.head(10)

‘FeatureHasher’ is the library that hash encode the attribute

6. Weight of Evidence Encoding

Weight of Evidence (WoE) evaluates a grouping technique’s “strength” in identifying good from bad. The primary goal of this technique’s development was to create a predictive model for assessing the risk of loan default in the credit and financial industries. How much the evidence bolsters or refutes a theory is determined by its weight of evidence, or WOE.

If P(Goods) / P(Bads) = 1, WoE is 0. If the result is random for that group, that is. If P(Bads) > P(Goods), the odds ratio is 1, and the weight of evidence (WoE) is 0. If, however, P(Goods) > P(Bads) in a group, the WoE is greater than 0.

Because the Logit transformation is just the log of the odds, or ln(P(Goods)/P(Bads)), WoE is ideally suited for Logistic Regression. As a result, the predictors are prepared and coded to the same scale when used in Logistic Regression with WoE-coded predictors. It is possible to compare the variables in the linear logistic regression equation directly.

Dataframe:

After Weight of evidence encoding:

Code:

from category_encoders import WOEEncoder
df = pd.DataFrame({‘cat’: [‘a’, ‘b’, ‘a’, ‘b’, ‘a’, ‘a’, ‘b’, ‘c’, ‘c’], ‘target’: [1, 0, 0, 1, 0, 0, 1, 1, 0]})
woe = WOEEncoder(cols=[‘cat’], random_state=42)
X = df[‘cat’]
y = df.target
encoded_df = woe.fit_transform(X, y)

‘WOEEncoder’ is the library that calculate the weigth of different categories

7. Helmert Encoding:

The dependent variable’s mean for a level is compared to the dependent variable’s mean across all earlier levels in this encoding.

Reverse Helmert Coding is another name for the variant in category encoders. The dependent variable’s level-specific mean is compared to its level-average mean across all preceding levels.

Dataframe:

After Helmert Encoding over Dept:

Code:

import category_encoders as ce
encoder=ce.HelmertEncoder(cols=’Dept’)
new_df=encoder.fit_transform(df[‘Dept’])
new_hdf=pd.concat([df,new_df], axis=1)
new_hdf

‘Helmert Encoder’ is the library that encodes the categorical dept into numerical value

8. Cat Boost Encoding:

A common method for category encoding is target encoding. A categorical feature is replaced with the average value of the target assigned to that category in the training dataset along with the target probability calculated across the entire dataset. However, since the target is needed to anticipate the target, this results in a target leakage. These models frequently exhibit overfitting and perform poorly when applied to novel situations.

Target leaking is a problem that a CatBoost encoder attempts to solve by using an ordering concept in addition to target encoding. It operates on a similar principle to time series data validation. Target probability for the current feature is only calculated from the rows (observations) before it, meaning that the values of target statistic depend on the observed history.

TargetCount: The total of the target value for a certain categorical feature (upto the current one).

Prior: Its value is constant and is denoted by (total number of observations (i.e. rows) in the dataset)/(sum of goal values throughout the whole dataset).

FeatureCount: The total number of categorical features that have been seen and have the same value as this one up to this point.

Dataframe

After Catboosting Encoding:

Code:

import category_encoders
category_encoders.cat_boost.CatBoostEncoder(verbose=0,
cols=None, drop_invariant=False, return_df=True,
handle_unknown=’value’, handle_missing=’value’,
random_state=None, sigma=None, a=1)

target = df[[‘target’]]
train = df.drop(‘target’, axis = 1)

# Define catboost encoder
cbe_encoder = ce.cat_boost.CatBoostEncoder()

# Fit encoder and transform the features
cbe_encoder.fit(train, target)
train_cbe = cbe_encoder.transform(train)

9. James Stein Encoding:

The James-Stein estimator provides the following weighted average for feature value:

The average target value for the feature value that was observed.
The average desired value (regardless of the feature value).

The average is shrunk toward the global average by the James-Stein encoder. This encoder is target-based. However, the James-Stein estimator has one practical drawback: it was only defined for normal distributions.

Disadvantage:

It can only be defined given a normal distribution (which is not the case in real time).
To prevent it, we can either utilize beta distribution or convert binary targets using a log-odds ratio, as was done in WOE Encoder (which is used by default because it is straightforward).

10: M Estimator Encoding:

A more straightforward variant of Target Encoder is M-Estimate Encoder.
It just contains one hyperparameter, m, which stands for the regularization power.
A larger m number produces stronger shrinkage.
The recommended range of values for m is 1 to 100.

11. Sum Encoder

Sum Encoder compares the dependent variable’s (target’s) mean for a specific level of a categorical column to the target’s overall mean. Both Sum Encoding and OHE are frequently employed in models of the linear regression (LR) variety. The interpretation of LR coefficients in the two models differs, with the Sum Encoder model’s intercept representing the overall mean (across all conditions) and the coefficients being easily understood as the main effects. In the OHE model, the intercept represents the mean for the baseline condition, and the coefficients represent simple effects (the difference between one particular condition and the baseline).

--

--