A summary of the encoding methods of Category Features

7 min readAug 6, 2022

In the process of machine learning modeling, many traditional machine learning models, such as logistic regression, support vector machine, K-nearest neighbor, etc., can’t directly use category features for training. However, in recent years, many new models, such as LightGBM and Catboost, have started to support the direct modeling of category features. Therefore, processing category features is an essential step before machine learning modeling.

This paper systematically combs the coding methods of nine types of features.

Background

When we preprocess data, we encounter category variables, which need to be coded and converted before they can be input into the model. According to different classification standards, category variables are:

Ordered by category: the characteristics of ordered and disordered categories.
According to the number of categories: the category characteristics of high base class and low base class.

According to different category features and tasks, the optional category feature coding methods are different. This paper mainly introduces common and easy-to-use category coding methods, hoping to be helpful to everyone.

Methods

Label Encoder

Coding is simply to give different categories and different digital labels. It is hard-coded, which has the advantage of simplicity and straightforwardness. Many online sayings apply to ordered category features, but if it is a classification task and there are not many categories, LGBM can perform well as long as it specifies categorical_feature. However, it is not recommended to be used in high base class features, and the natural numbers encoded by labels are linearly inseparable for regression tasks.

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()
x = ['male', 'female', 'male']
x_trans = le.fit_transform(x)>>>x_trans
array([1, 0, 1], dtype=int64)

Hash Encoder

Encoding is the hash mapping of label encoding using binary. The advantage is that the hash encoder doesn’t need to maintain the category dictionary, and the hash encoding can also be applied if there are subsequent categories that don’t appear in the training set. However, it is relatively difficult to learn the model because of bit-by-bit hash coding.

# !pip install category_encodersimport category_encoders as cex = pd.DataFrame({'gender':[2, 1, 1]})
ce_encoder = ce.HashingEncoder(cols = ['gender']).fit(x)
x_trans = ce_encoder.transform(x)>>x_trans
col_0  col_1  col_2  col_3  col_4  col_5  col_6  col_70
0      0      0      0      1      0      0      01      
0      0      0      1      0      0      0      02      
0      0      0      1      0      0      0      0

One-hot Encoder

Single-hot coding can solve the problem that label coding is linearly inseparable in regression tasks. It uses N-bit status registers to code N states. Simply put, 0 and 1 are used to represent category states, and the converted variables are called dummy variables. Similarly, it can’t handle high cardinality features well. The larger the base class, the more sparse features of many columns will be brought, which will consume memory and training time.

x = pd.DataFrame({'gender':['male', 'female', 'male']})
x_dummies = pd.get_dummies(x['gender'])>>>x_dummies
female  male
0       0     1
1       1     0
2       0     1

Count Encoder

Counting is also called frequency coding. It is to code the categories with the number of samples of different categories under the classification characteristics. It clearly reflects the frequency of the category in the data set, but the disadvantage is that the physical meaning of the category is ignored. For example, the frequency of occurrence of the two categories is equal, but in the business sense, the importance of the model may be different.

import category_encoders as cedf = pd.DataFrame({'cat_feat':['A', 'A', 'B', 'A', 'B', 'A']})
count_encoder = ce.count.CountEncoder(cols = ['cat_feat']).fit(df)
df_trans = count_encoder.transform(df)>>df_trans
cat_feat
0  4
1  4
2  2
3  4
4  2
5  4

Bin Encoder

Histogram coding is a kind of target coding, which is suitable for classification tasks. It classifies the category attributes first, and then under the corresponding attributes, counts the sample proportion of different category labels for coding. Histogram coding can clearly see the contribution of different categories to different prediction labels under features [1]. The disadvantage is that if label data is used, the distribution of category features in training set and test set is inconsistent, then the coding results will easily lead to over-fitting. In addition, the number of features encoded by histogram is the number of categories of classification labels. If there are many categories of labels, it may bring the burden of space and time to training.

import pandas as pdclass hist_encoder:def __init__(self, df, encode_feat_name, label_name):
    self.df = df.copy()
    self.encode_feat_name = encode_feat_name
    self.label_name = label_name
    
def fit(self):
    self.df['numerator'] = 1
    numerator_df = self.df.groupby([self.encode_feat_name,
                                    self.label_name])     
                                 ['numerator'].count().reset_index()
    
    self.df['denumerator'] = 1
    denumerator_df = self.df.groupby(self.encode_feat_name)    
                               ['denumerator'].count().reset_index()
    
    encoder_df = pd.merge(numerator_df, denumerator_df, on = 
                          self.encode_feat_name)
    encoder_df['encode'] = encoder_df['numerator'] / 
                           encoder_df['denumerator'] 
    
    self.encoder_df = encoder_df[[self.encode_feat_name, 
                                  self.label_name, 'encode']]def transform(self, test_df):
    test_trans_df = test_df.copy()
    for label_cat in test_trans_df[self.label_name].unique():               
       hist_feat = []            
       for cat_feat_val in   
          test_trans_df[self.encode_feat_name].values:                
          try:
             encode_val = encoder_df[(encoder_df[self.label_name] == 
                                      label_cat) & 
                                  (encoder_df[self.encode_feat_name] 
                                  == cat_feat_val)]['encode'].item()
             hist_feat.append(encode_val)
          except:
             hist_feat.append(0)
       encode_fname = self.encode_feat_name + 
                      '_en{}'.format(str(label_cat))
       test_trans_df[encode_fname] = hist_feat
    return test_trans_dfdf = pd.DataFrame({'cat_feat':['A', 'A', 'B', 'A', 'B', 'A'], 
                   'label':[0, 1, 0, 2, 1, 2]})
encode_feat_name = 'cat_feat'
label_name = 'label'he = hist_encoder(df, encode_feat_name, label_name)
he.fit()
df_trans = he.transform(df)>>df
cat_feat  label
0  A  0
1  A  1
2  B  0
3  A  2
4  B  1
5  A  2>>df_trans
cat_feat  label  cat_feat_en0  cat_feat_en1  cat_feat_en2
0  A  0  0.25  0.25  0.5
1  A  1  0.25  0.25  0.5
2  B  0  0.50  0.50  0.0
3  A  2  0.25  0.25  0.5
4  B  1  0.50  0.50  0.0
5  A  2  0.25  0.25  0.5

WOE Encoder

Applicable to the two-class task, WOE indicates the predictive ability of independent variables relative to dependent variables. As it has evolved from the world of credit score, it is usually described as a measure to distinguish good customers from bad customers. “Bad customers” refer to customers who default on loans. And “quality customer” refers to the customer who repays the loan.

WOE has several problems:

The denominator may be 0.
Without considering the influence of the number of different categories, there may be a large number of categories, but the WOE finally calculated is the same as that of a category with a small number of samples.
Only for the two-class problem.
There may be WOE coding differences between the training set and the test set (common fault

Target Encoder

It is a supervised coding method, which is suitable for classification and regression tasks, and has the characteristics of high base class disorder category.

The advantage of target coding is that it combines prior probability and posterior probability to code, but because the probability is directly calculated by using label data, it will lead to over-fitting.

from category_encoders import TargetEncoder
import pandas as pddf = pd.DataFrame({'cat_feat':['A', 'A', 'B', 'A', 'B', 'A'], 
                   'label':[0, 1, 0, 1, 1, 1]})
enc = TargetEncoder(cols=['cat_feat']).fit(df, df['label'])
df_trans = enc.transform(df)>>df_transcat_feat  label
0  0.746048      0
1  0.746048      1
2  0.544824      0
3  0.746048      1
4  0.544824      1
5  0.746048      1

Mean Encoder

Average coding is an improved version based on target coding. Its two changes are as follows:

Weight formula: Actually, there is no essential difference. You can modify the parameters in the function by yourself.
Because the label is used in the target coding, in order to alleviate the problem of model over-fitting caused by the coding, the K-fold coding idea is added to the average coding. If it is divided into 50%, then 1–4% will be used to fit first, then the 5th% will be transform, and so on, and the category features will be coded five times. The downside is time-consuming.

Model Encoder

At present, in GBDT model, only LGBM and CatBoost have their own category codes. The category code of LGBM is GS code.

According to the official document, GS coding is about 8 times faster than single heat coding. Moreover, the document also suggests that when the category variable is a high base class, even if it simply ignores the category meaning or embeds it in a low-dimensional numerical space, as long as the feature is converted into a numerical type, it will generally perform better. As far as personal use is concerned, I usually model code the unordered category variables, while the ordered category variables can be coded directly according to the sequential labels.

Although LGBM uses GS to encode category features, it seems quite powerful, but there are two problems:

Long calculation time: because GS calculation is required for each category value in each round.
Large memory consumption: for each split, the index information of different samples divided into different leaf nodes under a given category feature is stored.

Therefore, CatBoost uses Ordered TS coding, which not only makes use of the advantages of TS in saving space and speed, but also uses Ordered mode to alleviate the prediction offset problem.

Summary

With regard to category characteristics, there are the following experiences:

Statistical coding is often not suitable for small samples, because the statistical significance is not obvious.
When the distribution of training set and test set is inconsistent, statistical coding often has the problem of prediction offset, so cross-validation is generally considered.
The coding method with more features after coding is not suitable for the features of high base class, which will bring sparseness and training cost.
There is no perfect coding method, but sensory label coding, average coding, WOE coding and model coding are commonly used.

Translated from this blog post：https://mp.weixin.qq.com/s/3B_6M1gwlVM7IhOpW4GTew
Authorized by 宅码

A summary of the encoding methods of Category Features

Written by Darong Liu