Soccer as a clean laboratory for doing — and visualizing — statistical analysis: Volume 1

In this blog post, I hope to show that data on sports have much to offer in terms of yielding intuitive statistical results. The data we will analyze refer to a sample of male players in the top-tier Norwegian soccer league. These data have been extensively analyzed elsewhere, but for the present purposes we discard the goalkeepers from the sample. That leaves 217 players (or observations) for analysis.

Preliminaries

For starters, we consider the annual income of the soccer players as our y-variable or dependent variable (or outcome variable). Such income might, to simplify somewhat, be thought of as a function or result of three main factors:

(1) Player performance variables (e.g., number of goals scored, number of assists, number of matches played for the national team or played for the national team or not, player ’metrics’ variables)

(2) Player characteristics (e.g., age, origin, height, number of matches played in career)

(3) Club characteristics (e.g., the quality/rank of the club to which the player is affiliated)

The annual income variable is presented in Figure 1. We note the non-normal and right-skewed distribution and the average annual income of roughly 86,000 Euro (i.e., the dashed vertical line). Below we look at how some of the three factors above relate to annual income.

Figure 1 Histogram for annual player income; the dashed vertical line is the average income.

Annual income and performance

Having represented his national team, as opposed to not, is unquestionably an important performance indicator in top-tier soccer. Figures 2 and 3 both shows how players with national team experience (n = 72) have higher annual incomes in general than players who lack such national team experience (n = 145). Figure 3, a so-called violin plot, also depicts the much larger spread around the median for the players with national team experience.

Figure 2 Histogram for annual player income, by national team experience or not.
Figure 3 Distribution and median for annual player income, by national team experience or not.

The association between income and national team experience might differ depending on whether the player is of Norwegian or foreign origin. Figure 4, a so-called box and whiskers plot, dives into this. Notice how, among the Norwegian players, those on the national team clearly make more money per year than those not on the national team. That is, the median (i.e., the horizontal line within the box) is located higher on the income (or y) axis. Yet this we already knew. In contrast, among the foreign players, the vertical distance between the two medians is much shorter. In other words, being on the national team pays more off among the Norwegian players. This phenomenon — a larger effect among one subgroup in the data — is called an interaction effect in the statistical lingo.

Figure 4 Box and whiskers plot for annual player income, by national team experience or not and by player origin.

Annual income and club characteristics

We might surmise that better-performing clubs tend to pay their players more on average than worse-performing clubs. After all, one of the main reasons why a club performs well is that it has a very competent squad at its disposal which, in turn, tend to command higher wages — all else being equal. Figure 5 looks at this, where the “Fitted values” represent the regression line. We note the downward-sloping regression line suggesting that worse-performing clubs (i.e., larger numbers = “higher” rank = worse performances). That said, we would be very hard pressed to describe this association as strong for these data.

Figure 5 Scatterplot of annual player income and club rank, with linear regression line.

Annual income and player characteristics

Human capital theory posits that income increases with aging, but only up to a certain age point where it starts to decline. This inverse U-bend has been noted in, literally, thousands of prior studies worldwide. Figure 6 sheds light on this and suggests that the non-linear regression line (in red) has a somewhat better fit than the linear line (in green). That said, in the present case the inverse U-bend model is not a much better description of the relationship than the linear model.

Figur 6 Scatterplot of annual player income and match experience, with linear and non-linear regression lines.

That’s it for today, folks. Stay tuned for more upcoming analyses of sports data. For example, much suggests that the income variable should be logged … By the way, I do all my statistical computing in Stata. But I hope to learn R someday … I also hope to learn to upload my datasets on GitHub … And to provide more fancy graphs … Hope you enjoyed or learned something from my second blog post. My first, on the normality assumption in regression modelling, you find here.

About me: I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. You reach me at christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)