Linear and non-linear (i.e., polynomial) regression in sports

Although linear regression works well in situations where the underlying association between x and y deviates (slightly) from the linear one, there are several situations in which we should consider a non-linear approach. In this regard, the polynomial regression — a certain type of non-linear regression — is handy. In this blog, I’ll show you the steps from a scatterplot to a polynomial regression in my usual non-technical manner. That said, if you are rusty on linear regression, you might consider my primer on this topic from the field of sports (see here).

The data set up for analysis stems from a recent paper of mine in the scientific journal Managing Sport and Leisure, which you find in open-access form here. The sample in question contains 310 football (i.e., soccer) players in the top-tier Norwegian football league in the 2022-season (excluding the goalkeepers).

Two variables are of interest. The x-variable (or independent variable) is the average performance of the 310 players for the 2022-season, as measured by a ‘player metrics’ variable recorded by the Opta company. The y-variable (or dependent variable or response variable) is the number of times these players were picked out by expert journalists to join the Team of the Round in the 22-season. Descriptive statistics for these two variables appear in Stata-exhibit 1.

Stata-exhibit 1.

We note that the average performance variable (opta_perf) ranges between 5.91 to 8.15, with a mean of 6.15. Similarly, the Team of the Round variable (tor_22) ranges from 0 to 10 times, with an average of 0.97 times. (The latter variable is thus very skewed, which I do not get into now.)

The scatterplot

The scatterplot (where I have added some ‘jitter’ for aesthetic reasons) between the two variables is shown in Figure 1. We note the obvious positive correlation: better performances entail more picks for the Team of the Round. The not-shown correlation coefficient, r, is 0.65.

Figure 1. Scatterplot.

The scatterplot and the linear regression line

In Figure 2, the linear regression line is added to the scatterplot (again using ‘jitter’). We note the upward-sloping line with a regression coefficient/slope of 2.56 and an R-squared of 0.42 or 42 percent. Nothing new of note happens here, and we thus draw the same conclusion as we did for Figure 1: better performances pay off in the sense that they entail more picks for the Team of the Round — on average. That said, the fit between the data points and the regression line is far from ideal or perfect. To be more specific, we might be looking for a regression line that increases in its “growth” for larger values of the performance variable. We thus seek non-linearity!

Figure 2. Linear regression model.

The scatterplot and the polynomial (i.e., non-linear) regression line

In technical terms, any polynomial regression involving x and y simply adds a new variable to the regression: the square of x. (For this reason, polynomial regression is sometimes called quadratic regression.) The general idea is to put a U-formed or inverse U-formed regression line on the data points. The results of such a polynomial regression in our football data case is visualized in Figure 3.

Figure 3. Polynomial (or quadratic) regression model.

To put it roughly, Figure 3 suggests that it is the average performances better than 6.75 that “really” start to pay off in terms of being picked out for the team of the Round. The fit between the data points and the U-patterned regression line also looks somewhat better than in the linear model in Figure 2. But how do we judge this formally? There are three criteria: (1) The significance level of the squared term, (2) the R-squared of the polynomial regression compared with the linear regression, and (3) theoretical or common sense. In our case, the squared term for the performance variable should be different from zero and statistically significant. The coefficient in question is 2.40, with a p-value < 0.0001. The R-squared for the polynomial regression model is 0.53, which is clearly larger than the analogous 0.42 for the linear model. Regarding theoretical or common sense, I have little to offer in the present context since I’m no football expert!

Takeaways and implications

Linear regression has a lot going for it. Yet sometimes we must extend the plain-vanilla regression model to accommodate non-linear associations in our data (and thus in real life!). The polynomial regression model explicated above is the first step in this endeavor.

About me

I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. I am on ResearchGate here, and you also reach me at christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)