Types of Categorical Data Encoding Schemes

Lakshay Arora
Analytics Vidhya
Published in
5 min readAug 22, 2019

We know that most of the Machine Learning libraries accept only the data in the numerical form so it is essential to convert the Categorical variables present in our dataset into numbers. We simply cannot drop them from our dataset as they are known to hide a lot of interesting information. and it’s crucial to learn the methods of dealing with such variables.

Types of Encoding Schemes

Ordinal Encoding or Label Encoding

It is used to transform non-numerical labels into numerical labels (or nominal categorical variables). Numerical labels are always between 1 and the number of classes.

The labels chosen for the categories have no relationship. So categories that have some ties or are close to each other lose such information after encoding. The first unique value in your column becomes 1, the second becomes 2, the third becomes 3, and so on.

import pandas as pd
import category_encoders as ce
data = pd.DataFrame({
'city' : ['delhi', 'hyderabad', 'delhi', 'delhi', 'gurgaon', 'hyderabad']
})
# create an object of the OrdinalEncoding
ce_ordinal = ce.OrdinalEncoder(cols=['city'])
# fit and transform and you will get the encoded data
ce_ordinal.fit_transform(data)

One Hot Encoding

Here, we map each category to a vector that contains 1 and 0 denoting the presence of the feature or not. The number of vectors depends on the categories which we have in our dataset. For high cardinality features, this method produces a lot of columns that slows down the learning of the model significantly.

data = pd.DataFrame({
'gender' : ['M', 'F', 'M', 'F', 'F']
})
# create an object of the OneHotEncoder
ce_OHE = ce.OneHotEncoder(cols=['gender'])
# fit and transform and you will get the encoded data
ce_OHE.fit_transform(data)

Binary Encoding

First, the categories are encoded as ordinal, then those integers are converted into binary code, then the digits from that binary string are split into separate columns.

This is useful when you have a large number of categories and doing the one-hot encoding will increase the dimensions and which in turns increases model complexity. So, binary encoding is a good choice to encode the categorical variables with less number of dimensions.

# make some data
data = pd.DataFrame({
'class' : ['a', 'b', 'a', 'b', 'd', 'e', 'd', 'f', 'g', 'h', 'h', 'k', 'h', 'i', 's', 'p', 'z']})
# create object of BinaryEncoder
ce_binary = ce.BinaryEncoder(cols = ['class'])
# fit and transform and you will get the encoded data
ce_binary.fit_transform(data)

BaseN Encoding

In binary encoding, we convert the integers into binary i.e base 2. BaseN allows us to convert the integers with any value of the base. So, if you have something like city_name in your dataset which could be in thousands then it is advisable to use BaseN as it will reduce the dimensions further as you get after using binary-encoding.

You can set the parameter base. Here, I have encoded using base value 4 and 3 on a sample dataset.

# make some data
data = pd.DataFrame({
'class' : ['a', 'b', 'a', 'b', 'd', 'e', 'd', 'f', 'g', 'h', 'h', 'k', 'h', 'i', 's', 'p', 'z']})
# create an object of the BaseNEncoder
ce_baseN4 = ce.BaseNEncoder(cols=['class'],base=4)
# fit and transform and you will get the encoded data
ce_baseN4.fit_transform(data)
# create an object of the BaseNEncoder
ce_baseN3 = ce.BaseNEncoder(cols=['class'],base=3)
# fit and transform and you will get the encoded data
ce_baseN3.fit_transform(data)

Hashing

Hashing is the process of transformation of a string of characters into a usually shorter fixed-length value using an algorithm that represents the original string.

It uses md5 algorithm to convert the string into a fixed-length shorter string that we can define by using the parameter n_components. If you set the parameter to 5 then it doesn’t matter to the algorithm whether the length of your category is 7 or 700, it will convert it into a string of length 5 which will finally give us 5 different columns representing our categorical value.

Let’s try this out on a sample data:

data = pd.DataFrame({
'color' : ['Yellow', 'Black', 'Green', 'Blue', 'Blue', 'Green', 'Black', 'Blue']
})
# create an object of the HashingEncoder
ce_HE = ce.HashingEncoder(cols=['color'],n_components=5)
# fit and transform and you will get the encoded data
ce_HE.fit_transform(data)

Target Encoding

Here, features are replaced with a blend of the posterior probability of the target for the given particular categorical value and the prior probability of the target over all the training data. Also, they are not generated for the test data. We usually save the target encodings obtained from the training data set and use the same encodings to encode features in the test data set.

data = pd.DataFrame({
'color' : ['Blue', 'Black', 'Black','Blue', 'Blue'],
'outcome' : [1, 2, 1, 1, 2,]
})
# column to perform encoding
X = data['color']
Y = data['outcome']
# create an object of the TargetEncoder
ce_TE = ce.TargetEncoder(cols=['color'])
# fit and transform and you will get the encoded data
ce_TE.fit(X,Y)
ce_TE.transform(X)
test_data = pd.DataFrame({
'color' : ['Blue', 'Black', 'Black'],
})
ce_TE.transform(test_data)

Leave One Out

This is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce the effect of outliers.

data = pd.DataFrame({
'color' : ['Blue', 'Black', 'Black','Blue', 'Blue'],
'outcome' : [2, 1, 1, 1, 2]
})
# column to perform encoding
X = data['color']
Y = data['outcome']
# create an object of the TargetEncoder
ce_TE = ce.LeaveOneOutEncoder(cols=['color'])
# fit and transform and you will get the encoded data
ce_TE.fit_transform(X,Y)

Conclusion

In this article we covered various techniques to deal with categorical variables in the dataset. I hope you found this article useful. You can always reach out to me through the comment section below!

--

--