Decoding the Encoding of Categorical variables

Sanya Jain
3 min readOct 16, 2020

--

There are a few decisions data scientists have to make around categorical features in our data set while pre-processing. Whether to use one-hot encoding or label encoding.

Label encoding is converting categorical variables into numeric or machine-readable form. For example, Converting the feature into 0 as Male and 1 as female.

https://techterms.com/img/lg/binary_14.jpg
  • Label encoding is used when the data is ordinal or there is a relationship between the categories. For example, categories like average, good and outstanding can be label as 0,1 and 2 respectively. The model will assign more weightage/importance to good than average and similarly more weightage/importance to outsanding than good.
  • Next good reason to use label encoding is when there are too many categories in a feature that adding so many columns to the data set can take a lot of memory.

We can figure out that the use of label encoder can mislead the algorithm if the categories does not have any sort of order or relationship between them.

So to avoid this issue we use One-Hot encoding.

One hot encoding creates as many columns as the number of categories in the feature and maps 0 or 1 in each column. Let us take an example:

Interpretation: The first row indicates southwest region so there is 1 and the other columns have a zero.

But One-hot encoding comes with its own limitation called the dummy variable trap.

One Hot encoding can lead to multi-collinearity. Multi-collinearity occurs when two or independant features or predictors are highly correlated to each other. In order to avoid multi-collinearity we must drop one of those features.

Let us understand multi-collinearity with a real life example.

https://www.vectorstock.com/royalty-free-vector/cartoon-rock-band-vector-11554545

A band has two singers having almost the same voice, pitch and speed. We can remove one of the two singers as both the singers together may not be able to bring a significant change in the performance of the band. By removing one of the singers we are cutting the cost without having any affect on the band performance.

To solve the problem of multi-collinearity, we should drop one category from the one-hot encoded feature set.

We see that we have dropped the first category northeast from the data.

This can be done using the following code:

Remember, drop_first always drops the first column in each category.

--

--