Encoding Categorical features with high cardinality

Katerina Fomkina
4 min readAug 25, 2022

--

Most ML models accept only numerical values as input and do not work with categorical features, thus we need to transform categorical features into numbers to use them in the model.

The performance of ML algorithms is based on how categorical features are encoded. There are many types of encoders that can help us, but not all of them can handle features with high cardinality.

Binning

Having too many categories can lead to problems when training and testing the model, i.e. the curse of dimensionality or model have no idea how to handle the unseen category. One way to solve these issues is by engineering new features that have fewer categories, i.e. this can be achieved by combining several categories into one category.

It’s all about binning. We pick top data points(either categorizing similar values through some analysis or by sorting ) in the feature and place the others in the “Others” category or just drop them due to insignificance. After reduction of dimensionality we can use OHE, if the number of bins is not too big, because this can lead to highly sparse dataset.

➕ Binning categorical data reduces cardinality and the amount of columns created during encoding thus we have less chance of overfitting

➖ We can lose information or create wrong bins of data, which can lead to bad consequences.

Hash Encoding

The idea of the encoder is to convert data into a vector of features using hash function, which takes a string and returns a number between 0 and n-1.

Specifically, feature hashing maps each category in a feature to an integer within a predetermined range. Even if we have over 1000 distinct categories in a feature and we set 8 as the final feature vector size, the output feature set will still have only 8 features.

➕ The hash encoder does not maintain a dictionary of observed categories.

➕ The encoder does not grow in size and accepts new values during data scoring by design.

➖ Collisions, the encoder can hash different keys to the same integer value. Thus, in this method we have a trade-off between number of categories getting map to the same integer value (% of collisions) and the final feature vector size (n-components).

➖ Hash encoder is slow compared to other encoders.

Target Encoding

In this method each category is encoded given the effect it has on the target variable y. The encoder considers that a categorical variable may contain rare categories and, in addition, it takes an empirical Bayes approach to shrink the estimate.

➕ This method is really easy and powerful.

➕ Encoder picks up values that can explain the target.

➖ Using the mean as a predictor for the entire distribution is not as good, because we train models on a portion of the data and the mean of that portion is not necessarily be the mean of the entire population.

➖ Huge disadvantage — target leakage. The model will learn from a variable that itself contains a target, and this can lead to overfitting. There are various ways to handle this: increase regularization, add random noice to the category in train dataset, use double validation or use additive smoothing.

CatBoost

This technique using various statistics on combinations of categorical features and combinations of categorical and numerical features. Encoder replaces a categorical feature with average value of target corresponding to that category combined with the target probability over the entire dataset.

➕ CatBoost avoids the problem of leakage using the principal similar to the time series datas validation, i.e. target probability for the current feature is calculated only from observations before it, and avoids high variance for earlier training examples, using different permutations.

➕ Encoder prevents overfitting by repeating the process of target encoding multiple times on shuffled versions of the dataset and averaging the results.

Categorical Embeddings

This technique comes from Deep Learning and NLP, which translates large sparse vectors into a lower dimensional space that preserves semantic relationships.

An embedding is a matrix in which each column is the vector that corresponds to an item in vocabulary. To get the dense vector for a single vocabulary item, we need to retrieve the column corresponding to that item. The general idea of technique is the distance between the dense vectors will have meanings.

In theory we can use PCA, word2vec and other embedding techniques to reduce high dimensional space to low dimensional space. Whether or not to use embeddings will depend on the dataset and goals, and the only way to know for sure is to try it.

How to encode categorical features with high cardinality depends on whether the problem is asking about a regression or a classification model. The answer is also depends on types of variables (ordinal or nominal) and goals. Data Scientists still need to experiment to find the best solution for their particular case.

☼☽✨

--

--