Machine learning: one-hot encoding vs integer encoding

Stéphanie Crêteur
Geek Culture
Published in
6 min readDec 16, 2022

Which one is more efficient than the other in a model-building context?

First thing first: Why did I write this article? Well, while working on a machine learning project, I was confronted with this problem and I didn’t know which one to choose. So I had to take several minutes of my time to Google and find the reasons why one form could be preferred over the other. I found some useful information and decided to write this article to share what I learned. Hopefully, it will help others who are facing the same question.

Photo by Kevin Ku on Unsplash

But before starting, what are actually one-hot encoding and integer encoding? These two methods are options to numerically encode your data which take place during the preprocessing phase. Indeed, for the computer to understand the input you provide, it must be transformed into numerical values and this is where the one-hot encoding and the integer encoding come in. To illustrate this, I’ll use the example that led me to ask myself that very question. Using data from rotten tomatoes, I wanted to build a model that predicted whether a film would be classified as ‘Rotten’, ‘Fresh’, or ‘Certified-Fresh’. You can find the complete project with the data on StrataScratch. In my data frame, I had several features that I had to transform into numerical values, for example, the content rating (‘PG’, ‘R’, etc.), and audience status (‘Spilled’ or ‘Upright’).

Here is my dataframe before using any form of encoding.

There I had the choice between either using one-hot encoding or integer encoding. In the latter, each category is assigned a unique integer, and the integer values are used to represent the categories in the dataset. At that time, that’s the method I used — to be honest, the only reason I favoured that one is simply because I didn’t want to have too many columns in my df. To do so, I used the LabelEncoder() from sklearn.preprocessing and ended up with this result:

The last two columns are my new categorical data represented as integers. Its advantage is that it is pretty straightforward to implement. The code here was simply :

LE = LabelEncoder()
movies_cl['audience'] = LE.fit_transform(movies_cl['audience_status']

It has as well the advantage of preserving an ordinal relationship, however, that is as well one of the reasons you shouldn’t be using it. Indeed, Integer encoding can introduce a bias into the data by implying a natural ordering between the categories. This can lead to poor performance or unexpected results in machine learning models that use the encoded data. For example, the model may make predictions that are halfway between the encoded categories, which may not accurately reflect the original data. Furthermore, it implies that one category is more similar than another. For example, if we encode the categories “red”, “green”, and “blue” as 1, 2, and 3 respectively, the encoded values will imply that “red” is closer to “green” than it is to “blue”. It can lead to incorrect assumptions and conclusions.

In this case, where there is no ordinal relationship, it would be better to use One-hot encoding. This will create a new column for each category, and each column will have a value of 0 or 1 to indicate the presence or absence of that category in a given sample.

One advantage of one-hot encoding is that it allows the model to learn more easily and effectively. This is because each input value is represented as a binary vector, where only a single element of the vector is set to 1 and the rest are set to 0. This makes it easier for the model to learn the relationship between the input values and the target output because the model can easily distinguish between the different input values based on their unique binary representation.

To do so easily you can use the pandas’ get_dummies() which will give you this result:

Result from get_dummies()

You can as well use the sklearn.preprocessing OneHotEncoder() which is apparently better for machine learning. You can then put this feature into your original data frame and get rid of the old columns.

So, we can see that it is not really more complicated to implement than the integer encoding and One-hot encoding has some more advantages.

  • First, one-hot encoding is often considered to be more expressive than integer encoding, because it can more accurately represent the data and its relationships. Indeed, it can better represent the presence or absence of a category. . For example, if a sample belongs to the “Red” and “Green” categories, the one-hot encoded representation of that sample would be [1, 0, 1] (with a 1 in the first and third columns and a 0 in the second column). This accurately reflects the fact that the sample belongs to the “Red” and “Green” categories, but not to the “Blue” category. In contrast, the integer encoded representation of the same sample would be [1, 2], which does not provide any information about the presence or absence of the “Blue” category.
  • The model can as well more easily handle new input values that it hasn’t seen before. For example, if the model has been trained on a set of input values that includes the values “red”, “green”, and “blue”, and then it receives a new input value “yellow”, it can easily represent this new value as a binary vector and make a prediction based on its training.
  • Finally, one-hot encoding can also be more efficient in terms of memory and computational cost, because the binary vectors are typically much shorter and sparser than the corresponding integer encodings. This can be especially important for large datasets or complex models.

However, you have to note that one-hot encoding might lead to high dimensional data and to multicollinearity which might lower the model’s accuracy.

But why did I ask myself so many questions about those two methods? Well, simply because I saw that the Stratascratch solution was using a one-hot encoding method for the content rating when I had chosen to represent it with integer encoding and I couldn’t understand why it was doing so there when it was using integer encoding for the other categorical values. To be honest, I am not 100% sure that in this case, it was absolutely necessary to use one-hot encoding as my results were as good as Stratascratch’s (probably because content rating is perhaps not a very important feature in the construction of this model and we could as well find some sense of ordinality in it).

I hope that this article helped you better understand what is the difference between the two methods and what are its advantages and disadvantages of each. I mainly hope that you choose one version over another because you truly understand what it will imply for your model and not, as I did at first, because you don’t want to have too many columns in your dataframe… I’m at least happy that it piqued my interest enough to want to look into it.

--

--

Stéphanie Crêteur
Geek Culture

Python | Data analysis lover. Learning about AI and Natural Language Processing.