Categorical Features in Machine Learning

Nikhil
5 min readSep 19, 2021

--

Categorical variables are usually represented as ‘strings’ or ‘categories’ and are finite in number. For example, if you trying to do income prediction the dataset will typically have features such as education, age, gender, city, etc.

These features might take up values such as:

Sample Dataset

Can we not use the data as it is?

NOT Really! Since most machine learning models only accept numerical variables, preprocessing the categorical variables becomes a necessary step. We need to convert these categorical variables to numbers such that the model is able to understand and extract valuable information.

Categorical variables can be divided into two categories:

  • Nominal: No inherent order (Ex: Gender, City)
  • Ordinal: There is some inherent order between values (Ex: Education)

Lets start on different techniques to deal with such features

Label Encoding
Label Encoding also termed as Ordinal Encoding is used when the underlying feature has ordinal data meaning the order of the values is important. Hence, the encoding should reflect the sequence.

One Hot Encoding
This encoding technique can be used when the features are nominal(do not have any order). Here, for each level of a categorical feature, we create a new variable (Dummy variable). Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category.

Dummy Encoding
This encoding technique is similar to one-hot encoding. Here we transforms the categorical variable into a set of binary variables (also known as dummy variables). In the case of one-hot encoding, for N categories in a variable, it uses N binary variables. The dummy encoding is a small improvement over one-hot-encoding. Dummy encoding uses N-1 features to represent N labels/categories.

We wish it was that easy! Both the techniques: One hot encoder and dummy encoder are effective however, they have two major drawbacks.

  1. Features such as city might have huge number of unique values causing similar number of dummy variables to encode. For instance, with 30 different cities we will require 30 new variables for coding.
  2. And imagine if we have multiple categorical features in the dataset, we will end up having several binary features each representing the categorical feature and their multiple categories.

These two encoding schemes introduce sparsity in the dataset i.e several columns having 0s and a few of them having 1s, which is trust me not a good thing.

Hash Encoder
The Hash encoder represents categorical features using the new dimensions. Here, the user can fix the number of dimensions after transformation using n_component argument. In simple words, A feature with 5 categories can be represented using N new features similarly, a feature with 100 categories can also be transformed using N new features.

Since Hashing transforms the data in lesser dimensions, it may lead to loss of information. Another issue faced by hashing encoder is the collision. Since here, a large number of features are depicted into lesser dimensions, hence multiple values can be represented by the same hash value, this is known as a collision.

Binary Encoding
Binary encoding is a combination of Hash encoding and one-hot encoding. In this encoding scheme, the categorical feature is first converted into numerical using an ordinal encoder. Then the numbers are transformed in the binary number. After that binary value is split into different columns.

Base N Encoding
BASE?? the Base or the radix is the number of digits or a combination of digits and letters used to represent the numbers. For Binary encoding, the Base is 2 which means it converts the numerical values of a category into its respective Binary form. If you want to change the Base of encoding scheme you may use Base N encoder. In the case when categories are more and binary encoding is not able to handle the dimensionality then we can use a larger base such as 4 or 8.

In the above example, I have used base 5 also known as the Quinary system. It is similar to the example of Binary encoding. While Binary encoding represents the same data by 4 new features the BaseN encoding uses only 3 new variables.

Hence BaseN encoding technique further reduces the number of features required to efficiently represent the data and improving memory usage. The default Base for Base N is 2 which is equivalent to Binary Encoding.

To summarize,

Encoding categorical data is an unavoidable part of the feature engineering. It is more important to know what coding scheme should we use having into consideration the dataset we are working with and the model we are going to use.

--

--

Nikhil
0 Followers

Data Scientist | RBC | JP Morgan Chase | University of Ottawa