Analytics Vidhya
Published in

Analytics Vidhya

Photo by lilartsy from Pexels

Interaction effects in machine learning

As an analyst you want to get the most out of your regression model. While you are often limited (or spoilt) by the number of features in your dataset, there are creative ways to power up your model.

One of them is to look at interaction effects between features. An interaction term can be created as a combination of two or more existing features.

Let’s understand with an example.

Here is a regression plot of house prices with one feature i.e. area. No surprises here, a positive relationship between area and home prices.

There is another feature ‘KitchenQuality’ that you think should affect house prices. After all, it is logical to assume that homes with good quality kitchens are priced higher than the rest.

Here is how the relationship looks with KitchenQuality.

No surprises again, homes with ‘good’ quality kitchens are generally priced higher than the ‘fair ones’. However, notice the slope is much higher on homes with good quality kitchens i.e. prices rise much higher with area in these homes compared to the ‘fair’ quality ones. The slope will become important later.

You fit a model using Area and KitchenQuality. Here, are the results.

The model has an adjusted R square of 0.647 and both features are significant at 95% confidence interval.

A simplified model equation would look like this.

  1. If KitchenQuality = Good (i.e. 1), Price = 4.462e+04 + 10.0173*Area + 1.209e+05*KitchenQuality or Price = (4.462e+04 + 1.209e+05) + 10.0173*Area
  2. If KitchenQuality = Fair (i.e. 0), Price = 4.462e+04 + 10.0173*Area

Note: KitchenQuality is a dummy variable encoded as 1 for ‘Good’ and 0 for ‘Fair’.

Let’s plot the two regression lines.

Notice both equations have the same slope i.e. 10.0173, they only differ in the intercepts. As expected, the lines are parallel.

Now, let’s fit the model with one additional feature which is the interaction term between Area and KitchenQuality i.e. Area * KitchenQuality.

All features are significant at 95% confidence interval and adjusted R square has gone up to 0.686 from 0.647. That means the new model explains more variation in the price compared to the previous one.

A simplified model equation would look like this.

  1. If KitchenQuality = Good (i.e. 1), Price = 9.444e+04 + 2.006e+04*KitchenQuality + 4.4779*Area + 10.7729*Area*KitchenQuality or Price = ( 9.444e+04 + 2.006e+04) + 15.2508*Area
  2. If KitchenQuality = Fair (i.e. 0), Price = 9.444e+04 + 4.4779*Area

The regression lines look like this.

Notice, both slope and intercept are different now. This model captures the essence of the relationship better than the earlier model which was limited by the same slope for each level of KitchenQuality.

To conclude -

  1. Interaction effects can explain additional variation in the dependent variable over and above the individual features
  2. Not all interaction effects are significant. You will need to test them by evaluating the model with different interaction terms.
  3. To avoid overfitting, do not add just mechanically add too many interaction terms.

Credits : Codecademy

Do you use interaction terms in your models ?

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Recommended from Medium

Before you jump into Modelling

Understanding Dice Loss for Crisp Boundary Detection

Normal Distribution - The Bell Curve

IMDb Score Prediction

5 Resources I’m Using to Transition to a Data Science Career in FinTech

laptop on a bed with a cup of coffee

Making Sense of it all with NLP

Covid mortality analysis

Weekend Update 5/6th of November

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Harish Daryani

Harish Daryani

Lifelong learner

More from Medium

Feature Selection Using Boruta

Awesome dataset resource Every data scientist and aspirant must need to know in 2022

Data Analysis on Student’s Performance Dataset from Kaggle.

Ensemble Methods, Bagging and the Statistics behind it .