Sports data, regression analysis, and interaction effects

Christer Thrane (christer.thrane@inn.no)

3 min readDec 15, 2023

Regression analysis is the workhorse of quantitative (social) science, and sports data are instructive for showing how regression works — as I have demonstrated in several previous postings. A recurring topic in regression-based research is a phenomenon called interaction effects, and such effects are what I’m going to show and explain today — as always in a non-technical jargon. That said, this post presumes that you have a rudimentary grip on plain-vanilla linear regression. (If not, see my previous sports blog here). I use my football (soccer) data for illustration. These data have been analyzed elsewhere, but for the present purposes I discard the goalkeepers from the sample. That leaves 217 players for analysis.

Age and total number of matches played in career

Figure 1 shows the association between age and total number of matches played in career by means of a scatterplot including a linear regression line. Obviously, older players have on average played more matches than younger ones. The slope of the line is almost 13, suggesting that one more year of aging “entails” 13 more matches on average. The R-squared is 52%. So far so good.

The plain-vanilla regression model in Figure 1 makes many assumptions, one of which is of vital importance in our context: It assumes that the slope of the regression line is of roughly the same magnitude for all subgroups in the data. Yet this is always an empirical question that should be examined accordingly, but of course within reason.

When subgroups in the data have different regression slopes

Two subgroups in the present data are the national team players and the players not on a national team. In Figure 2, I have “re-run” the analysis in Figure 1 using separate colors for the two subgroups and separate colors for the corresponding slopes. The figure pretty much speaks for itself: There is a stronger association — as in a larger regression slope — for the national team players. In other words, the effect of age on total number of matches is dependent on player type. If we now replace age with x, number of matches with y, and player type with z, we have the formal textbook definition of an interaction effect: the effect of x on y is dependent on (the value) of z.

The technicalities

How do we more formally know that we are in the presence of a real interaction effect? Two things are of note in this regard: (1) The so-called interaction variable (i.e., the product of age and player type) must be clearly larger or smaller than 0 as well as statistically significant. (2) The R-squared should be larger than it is for the (plain-vanilla) regression model not containing the interaction variable. Stata-exhibit 1 highlights the particulars.

Takeaways and implications

Regression models (should more?) often entertain the possibility that the x-variable of main interest’s effect on y might be of different magnitudes for subgroups in the data. This blog post has explained the key idea in question. Finally, it is often claimed that a picture says more than 1,000 words. When it comes to interaction effect, this adage is spot on.

About me: I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. I am on ResearchGate here, and you also reach me at christer.thrane@inn.no

Sports data, regression analysis, and interaction effects

Written by Christer Thrane (christer.thrane@inn.no)