Linear Regression — Dummy Variable Trap

Iftikhar Liaquat
Analytics Vidhya
Published in
4 min readMay 10, 2020

Prerequisites

Overview

Following are the topics that we will be going to cover in the blog:

  • What is a Dummy Variable Trap?
  • Explanation
  • Dummy Variable Trap Test

What is a Dummy Variable Trap?

In linear regression models, to create a model that can infer relationship between features (having categorical data) and the outcome, we use the dummy variable technique.

A “Dummy Variable” or “Indicator Variable” is an artificial variable created to represent an attribute with two or more distinct categories/levels.

The dummy variable trap is a scenario in which the independent variables become multicollinear after addition of dummy variables.

Multicollinearity is a phenomenon in which two or more variables are highly correlated. In simple words, it means value of one variable can be predicted from the values of other variable(s).

Explanation

To better understand the scenario, I’m going to explain it with an example. Let wages be the function of gender.

For linear regression, the mathematical equation will:

Let’s say in our data, the possible gender values are:

  • Male
  • Female

First, we’ll add following two dummy variables (two columns) in the dataset:

  • Dummy variable for Male will be DM
  • Dummy variable for Female will be DF

Now, the equation of linear regression will be:

I’m going to create a table with arbitrary values in order to show you the dummy variable trap. For simplicity, I’m considering that the value of all constants are equal to 1.

In the table shown below, if the gender is male the value of column “Male” will be 1 and column “Female” will be 0. Also, if the gender is female the value of column “Male” will be 0 and column “Female” will be 1.

Let’s introduce a column by adding the values in Male and Female column and make it a part of dataset in order to clearly see the problem. After this, the dataset will become:

Now, here you can see that the value of “Constant” and “Calculated Col” columns is exactly same. This breaks the assumption of linear regression that observations should be independent of each other and this is what we called a dummy variable trap. By adding all the dummy variables in data, we have compromised the accuracy of the regression model.

To avoid dummy variable trap we should always add one less (n-1) dummy variable then the total number of categories present in the categorical data (n) because the nth dummy variable is redundant as it carries no new information.

Let’s eliminate one dummy variable from our equation then the new equation will be:

Here, if the value of “DM” is 1 then it means Male and if its value is 0 them it means Female.

We don’t have to add a new column as we have only one dummy column and we can see that its value is different from the Constant column.

So guys,we have successfully avoided the dummy variable trap.

Dummy Variable Trap Test

In order to check that whether a dataset have dummy variable trap scenario, we multiply the transpose of independent variable matrix(X’) with the independent variable matrix (X) and then we calculate it’s determinant. If

  • The determinant is 0 then we are facing the dummy variable trap scenario.
  • The determinant is not 0 then we are not facing the dummy variable trap scenario.

Conclusion

To avoid dummy variable trap we should always add one less (n-1) variable then the total number of categories present in the categorical data (n) while adding dummy variables.

Please let me know if there are any questions or need some clarification.

--

--