What is the Dummy Variable Trap and How to Avoid it?

Be careful when encoding categorical variables!

Rukshan Pramoditha
Data Science 365

--

You’ll often find categorical variables in your datasets and you need to encode them before using them in any machine learning algorithm.

There are many different types of encoding techniques. One-hot encoding and Dummy encoding are the most popular ones among them. You can read this post to learn more about those two encoding techniques. I highly recommend you to read it because it includes all the prerequisite knowledge that is needed to understand today’s content.

Dummy variable trap

The dummy variable trap occurs when we use one-hot encoding to encode categorical variables. In one-hot encoding, k (where k is the number of unique categories in a categorical variable) number of dummy (binary) variables are created for each categorical variable and those newly created dummy variables are highly correlated (multicollinear).

For example, we have a categorical variable Color with three categories called “Red”, “Green” and “Blue” and we encode that variable using one-hot encoding as follows.

In this example, the dummy variable trap occurs as we create a duplicate category to encode the Color variable. We can drop the duplicate category and…

--

--

Rukshan Pramoditha
Data Science 365

3,000,000+ Views | BSc in Stats | Top 50 Data Science, AI/ML Technical Writer on Medium | Data Science Masterclass: https://datasciencemasterclass.substack.com