Dummy Variable Trap explained with Time Series Data

Sasidhar Sirivella
Analytics Vidhya
Published in
3 min readMay 11, 2020

--

“Knowing where the trap is — that’s the first step in evading it.”

Many datasets that we come across will have a combination of continuous variables and categorical variables. Few algorithms like decision tree can work with categorical data directly, but many other ML algorithms cannot work with label data and we need to transform all the variables to be numeric.

Example: A City variable with the values : ‘Philadelphia’, ‘New York’, ‘Washington’, ‘Delaware’

But, how do we transform and include categorical data (non-numeric data) into our regression models?

By creating a binary variables called “dummy variables”. Now, how do we create dummy variables? By using One Hot Encoding. This allows our model to assume a natural ordering between the categories.

Consider the city example mentioned earlier. Since there are 4 categories, we need 4 binary variables as shown below.

The values represented above by dummy variables — variables containing values such as 1 or 0 representing the presence or absence of the categorical value.

Now, let’s see what’s the TRAP in here.

If a categorical variable can take n values, the common mistake we do is defining n dummy variables. We need to resist this…

--

--