Published in

Analytics Vidhya

# Interaction effects in machine learning

As an analyst you want to get the most out of your regression model. While you are often limited (or spoilt) by the number of features in your dataset, there are creative ways to power up your model.

One of them is to look at interaction effects between features. An interaction term can be created as a combination of two or more existing features.

Let’s understand with an example.

Here is a regression plot of house prices with one feature i.e. area. No surprises here, a positive relationship between area and home prices.

There is another feature ‘KitchenQuality’ that you think should affect house prices. After all, it is logical to assume that homes with good quality kitchens are priced higher than the rest.

Here is how the relationship looks with KitchenQuality.

No surprises again, homes with ‘good’ quality kitchens are generally priced higher than the ‘fair ones’. However, notice the slope is much higher on homes with good quality kitchens i.e. prices rise much higher with area in these homes compared to the ‘fair’ quality ones. The slope will become important later.

You fit a model using Area and KitchenQuality. Here, are the results.

The model has an adjusted R square of 0.647 and both features are significant at 95% confidence interval.

A simplified model equation would look like this.

1. If KitchenQuality = Good (i.e. 1), Price = 4.462e+04 + 10.0173*Area + 1.209e+05*KitchenQuality or Price = (4.462e+04 + 1.209e+05) + 10.0173*Area
2. If KitchenQuality = Fair (i.e. 0), Price = 4.462e+04 + 10.0173*Area

Note: KitchenQuality is a dummy variable encoded as 1 for ‘Good’ and 0 for ‘Fair’.

Let’s plot the two regression lines.

Notice both equations have the same slope i.e. 10.0173, they only differ in the intercepts. As expected, the lines are parallel.

Now, let’s fit the model with one additional feature which is the interaction term between Area and KitchenQuality i.e. Area * KitchenQuality.

All features are significant at 95% confidence interval and adjusted R square has gone up to 0.686 from 0.647. That means the new model explains more variation in the price compared to the previous one.

A simplified model equation would look like this.

1. If KitchenQuality = Good (i.e. 1), Price = 9.444e+04 + 2.006e+04*KitchenQuality + 4.4779*Area + 10.7729*Area*KitchenQuality or Price = ( 9.444e+04 + 2.006e+04) + 15.2508*Area
2. If KitchenQuality = Fair (i.e. 0), Price = 9.444e+04 + 4.4779*Area

The regression lines look like this.

Notice, both slope and intercept are different now. This model captures the essence of the relationship better than the earlier model which was limited by the same slope for each level of KitchenQuality.

# To conclude -

1. Interaction effects can explain additional variation in the dependent variable over and above the individual features
2. Not all interaction effects are significant. You will need to test them by evaluating the model with different interaction terms.
3. To avoid overfitting, do not add just mechanically add too many interaction terms.

Do you use interaction terms in your models ?

## More from Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Lifelong learner