Handling High Cardinality, Out Of Vocabulary and Cold Start in categorical data

extending data representational Machine Learning Design patterns

Mehul Gupta
Data Science in your pocket

--

Photo by Krisztian Tabori on Unsplash

So, we discussed about different methods to scale numerical features in my last post. Extending Data Representational ML Design patterns, we would be discussing how to handle two major problems we would face in representing Categorical Data into numerical format →

High Cardinality → If the number of categories is very high (say 1000) in a particular column, methods like One-Hot encoding will just boom adding 1000 new features. This is something we would wish to avoid because of the “Curse Of Dimensionality”.

Out Of Vocabulary→ Assume you converted categories into numerical features using One-Hot Encoding. After deployment, one more category is introduced. But if you know how One-Hot Encoding works, OHE will just break down

Cold Start → Assume somehow OHE manages to incorporate the new category introduced, but the model you have trained will see that category for the once time while inferencing hence producing poor results. How to tackle this cold start problem for the new category?

Wait a minute, why aren’t we considering Label Encoding inplace of OHE?

So LabelEncoding basically assigns values from 1 →n if we have ’n’ classes where each class is assign a single integer value. Now, the catch with Label Enbcoding is it, unknowingly, converts Nominal data into Ordinal data.

For example: If we have Dogs, Cats & Horses as 3 categories, label encoding will assign say dogs:1, cats:2, horses:3 . Now, given these integer values, dogs & cats (2–1=1)are more closer as compared to dogs & horses (3–1=2) but actually such a relationship doesn’t exists. Hence, Label Encoding should be ignored if possible !

Hence, both Label Encoding and One-Hot Encoding are eliminated for now.

So, coming back to the 3 major problem statements, how to handle these? We will be discussing two major design patterns to tackle these problems

Hashing

The approach is very easy

Apply some hashing algorithm on the categorical data. A hashing algorithm usually returns a integer value given a string as input. This algorithm can be MD5, fingerprint algorithm, SHA-256, etc

Set up desired number of buckets. Suppose you have 1000 categories, you set up 100 buckets.

Divide the results of the hashing algorithm for each string by the desired number of buckets and take a modulus of the operation as the final representation of the string.

So assume you applied some hashing algorithm “ABC” on string “New Delhi” getting an output of 1234, divide it by desired number of buckets i.e. 100 getting a remainder = 34. This becomes our final numerical representation for the string “New Delhi”

Easy Peasy. Let’s see the pros and cons of this approach

Pros

The issues around high cardinality have been resolved as we have converted 1000 categories to 100 numerical representation. Now you can go on with OHE if you wish. Also both OOVs and Cold Start is also resolved as

  • For every string, seen or unseen, you will be able to assign a numerical representation and as the numerical representation won’t fall out of the desired bucket number, OHE is also able to incorporate it
  • As we are reducing 1000 categories to 100 categories, some of the categories would be getting the same numerical representation, hence initially, the newly added category can be considered similar to other categories which got the same numerical representation after applying hashing. I know, this is not the ideal situation, but better than having no idea (Cold Start).

Cons

A major con is we will have some/many collisions depending upon the number of buckets we decide over. In the above example, if we choose 100 buckets for 1000 categorical values, we can expect 10 categories getting assigned the same value. So choosing over the 1) Bucket Size 2) Hashing algorithm is very important. Assume Bucket Size to be a hyperparameter.

Embeddings

I think you must have used embeddings sometime in your life if worked on NLP problems. The idea is similar, just like we develop embedding for text in NLP problems using Word2Vec or BERT or something else. Some major advantages we get using embeddings is

  • As embedding size is constant, high cardinality is no longer a problem.Whatever string you have, it eventually gets encoded in a fixed representation.
  • At times, we may have some relationship between categories. So assume we have a category ‘animal’ name where we have categories like ‘Cheetah’, ‘Elephant’, ‘Leopard’, ’Dogs’. Now we know cheetah & leopard should have a representation that is closer as compared to elephant and dogs as they belong to the cat family.
  • Also OOVs and Cold Start get’s easily resolved if we use some pretrained model for generating embeddings.

I have already mentioned a few text vectorization algorithms alongside a detailed explaination on BERT hence skipping this for now.

With this, we saw how we can overcome a few common issues we may face while converting categorical data into their numerical representation using ML Design patterns. I will be covering some other Design patterns in my next posts.

A big thanks to the book before wrapping up

--

--