Dealing with categorical features with high cardinality: Feature Hashing

Kumar Vishwesh
Flutter Community
Published in
4 min readJun 9, 2020

Many machine learning algorithms are not able to use non-numeric data. Usually, these features are represented by strings, and we need some way of transforming them into numbers before using scikit-learn’s algorithms. The different ways of doing this are called encodings.

Categorical data can pose a serious problem if they have high cardinality i.e too many unique values.

The central part of the hashing encoder is the hash function, which maps the value of a category into a number. For example, a (Give it a name: “H1”) hash function might treat “a=1”, “b=2”, “c=3”, “d=4” ……

Then, sum all the values together of the label together.

hash("abc") = 1+ 2+ 3 = 6 # H1 hash function

Feature hashing maps each category in a categorical feature to an integer within a pre-determined range. Even if we have over 1000 distinct categories in a feature and we set b=10 as the final feature vector size (a pre-determined range), the output feature set will still have only 10 features as compared to 1000 binary features if we used a one-hot encoding scheme.

Hash functions also come with a not-so-nice side effect: they can hash different keys to the same integer value (this is known as ‘collision’), it will certainly happen in this case as we had represented 1000 distinct categories as 10 columns.

You have to do a trade-off between no. of categories getting mapped to the same integer value(% of collision) and the final feature vector size (b i.e. a pre-determined range). Here, b is nothing but n_components.

The idea of “feature hashing”:

Convert data into a vector of features. This is done using hashing, we call the method “feature hashing” or “the hashing trick”.

How it works:

A simple example using text as data.

Let’s say our text is: “Everything is planned”

We would like to represent this as a vector. The first thing, we need to fix the length of the vector, the number of dimensions, we are going to use, let’s say we would like to use 5 dimensions (n_components, discussed below: how to use this parameter in python).

Once we fix the number of dimensions we need a hash function (hash_method, discussed below: how to use this parameter in python) that will take a string and return a number between 0 and n-1, in our case between 0 and 4. Any good hash function can be used and you just use h(string) mod n to make it return a number between 0 and n-1.

I’ll invent the results for each word in our text:

Please note that for a given number the remainder can only be 0 to n-1. So if we divide any number by 5 the remainder can only be 0,1,2,3 or 4.

h(everything) mod 5 = 0
h(is) mod 5 = 1
h(well) mod 5 = 1
h(planned) mod 5 = 3

Once we have this we can simply construct our vector as:

(0,1,1,3)

Another example, a pictorial representation:

n_components=8 and hash Function= Murmurhash3

Index value here is nothing but a reminder. Every index value becomes the feature for the given dataset. You will see in the last section (Example: Python Code) of this article that we have 8 new columns generated out of a categorical column: col_0 to col_7.

The index 0 to 7 (shown in the above image) can be represented as col_0 to col_7 with possible contained values as 1 or 0 i.e. 8 columns which contain binary values.

In the above image, “Sentence” is a categorical column.

Replace “Sentence” with col_0 to col_7 [index 0 to 7] which contains either 1 or 0.

Run the code snippet provided in the last section, interpret the results i.e. transformation/encoding, reread the article if things are not crystal clear.

A python library comes to rescue:

A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques. Ref: https://pypi.org/project/category-encoders/

From this library: http://contrib.scikit-learn.org/category_encoders , we have a class (A reference python code is present in the last portion of this article):

classcategory_encoders.hashing.HashingEncoder(max_process=0, max_sample=0, verbose=0, n_components=8, cols=None, drop_invariant=False, return_df=True, hash_method='md5') [source]

It has two important arguments:

hash_method: str

which hashing method to use. Any method from hashlib works.

You can use a hashing algorithm(hash function or method) of your choice; the default is “md5”.

n_components: int

how many bits to use to represent the feature. By default we use 8 bits. For high-cardinality features, consider using up-to 32 bits.

The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.

You want interpretability for the contribution of each of your levels. You won’t be able to give a good answer.

Example: Python Code

from category_encoders.hashing import HashingEncoder

import pandas as pd

from sklearn.datasets import load_boston

bunch = load_boston()

X = pd.DataFrame(bunch.data, columns=bunch.feature_names)

y = bunch.target

he = HashingEncoder(cols=[‘CHAS’, ‘RAD’]).fit(X, y)

data = he.transform(X)

print(data.info())

https://www.twitter.com/FlutterComm

--

--

Kumar Vishwesh
Flutter Community

AI/ML PM at Intel | Ex-BCG | www.datasciencestunt.com | Enroll in the "Advanced Data Analytics Professional Certificate" by Google: imp.i384100.net/q4zLN5