Data science and sports: three vital statistical techniques for showing variable associations

The main point of most sports, if not the only point, is to rank individual athletes or teams regarding the sporting event in question. Outcome variables in sports thus tend to be quantitative in nature; we are talking about varying amounts of something easily measured with numbers: points, assists, goals, hits, wins, meters, hours, minutes, seconds, tenth of seconds, hundredth of seconds and so on. Yet there is one important exception to this broad rule: the dummy variable. That is, on many occasions a sports contest boils down to a win or a loss — as in success or failure.

Whether the context is data science or research, three statistical techniques cover most of what you might need to know about how other variables — i.e., independent variables or x-variables — explain statistical variation in sports outcome variables. In this post, I am going to demonstrate these in my usual none-technical spirit. (I apologize to those more mathematically inclined!) The data set up for analyses stems from a recent paper of mine in the journal Managing Sport and Leisure, which you find here in open-access form. The data in question pertain to 310 football (i.e., soccer) players in the top-tier Norwegian football league in the 2022-season (excluding the goalkeepers).

1. The comparison of averages (i.e., means)

We’ll start by examining the outcome variable number of goals scored in career. The minimum for this variable is no goals, the maximum is 191 goals, and the average or mean is 23 goals. (The variable is skewed, but let’s not get into that.) An important task in many data science or research contexts is to examine if this overall mean of 23 goals varies according to subgroups in the data. In textbooks on statistics or research methods, the often-employed procedure in question is referred to as ANOVA (i.e., analysis of variance). Yet what this technique really — I mean really! — boils down to in practice, is the comparison of subgroup means. Now, let’s consider the x-variable player position in this regard.

Figure 1 presents the mean values for the three player positions in a bar plot. We unsurprisingly note that defenders score fewer goals on average than midfielders: 11 versus 18. Yet attackers or forwards, again profoundly unsurprisingly, score most with 40 goals on average.

Figure 1

We might also want to compare the three corresponding medians, and the boxplot in Figure 2 takes care of this. The main story is the same as in Figure 1; the three medians — the horizontal lines inside the boxes — follow the same rank as the three means: attackers > midfielders > defenders. The outer limits of the three vertical lines — the so-called whiskers — captures practically all of the three scoring distributions. Finally, the individual dots represent outliers, i.e., players scoring many more goals than the rest of the players.

Figure 2

The information in Figure 2 might alternatively be presented in a violin plot, as done in Figure 3. A violin plot is arguably a better presentation of the distribution of a variable than a box plot.

Figure 3

We can also do subgroup comparisons for two x-variables simultaneously. The bar plot in Figure 4 is an example, where player position is broken down on the national origin of the players. We see that Norwegian and foreign players are rather similar regarding goal scoring among the attackers (in grey). In contrast, Norwegian players tend to score more than foreign players among the defenders and midfielders (in red and blue).

Figure 4

Of course, we can also make box plots or violin plots for the variables depicted in Figure 4, but let’s save that for another day. Now, it’s time to look at some other y-variables in sports.

2. The comparison of proportions

Sometimes you win, sometimes you lose; sometimes you get the main price, sometimes you don’t. That is, pardon the pun, the name of the game. Sports variables such as these are dummy variables, and they have a special place in data science and research. When we want to associate an x-variable with a dummy y-variable, the go-to statistical technique has for 120 year or so been the cross-table. In the football data, one such “price variable” refers to if a player has been picked out for the Team of the Round (ToR) or not. For the 2022-season, 46 percent of the players had been appointed once or more to this team, whereas 54 percent had not (obviously). This proportion is broken down on the quality of the club for which a player plays for in Stata-exhibit 1.

Stata-exhibit 1

We note that a player for a bottom-four club has a 27 percent probability of making the ToR (in red), whereas a player for a top-four club has a 67 percent probability of making this team (in blue). That is, the quality of the club matters a great deal for getting on the ToR. If we prefer, and some do, this cross-table may by presenten in a graph — as in Figure 5.

Figure 5

Yet we can also break down (the cross-table behind) Figure 4 on Norwegian and foreign players, as in Figure 6. But let’s leave it at that.

Figure 6

3. When both the sports outcome variable, y, and the x-variable are continuous

Once more our outcome variable is goals scored in career. The x-variable is the age of the players, also a continuous variable. Whenever we want to associate two continuous variables in data science or research, regression has been the obvious choice since Francis Galton in many ways invented this technique in the 1880s. (See my post here on his doings in that regard.) Figure 7 is a scatterplot with a regression line superimposed. (See my primer om linear regression in a sports context here.)

Figure 7

Figure 7 unsurprisingly tells us that older players have scored more goals in their career than younger players. The reason is of course that older players have played more matches (getting more opportunities to score) than younger players — all else being equal. Such a regression analysis can also be extended to accommodate subgroups in the data. Figure 8 takes care of this for the three player positions. (We normally discard the data points when we also consider subgroups in regressions.)

Figure 8

Figure 8 shows that the association between age and goal scoring is strongest for the attackers as in having the steepest regression line. This is hardly surprising. In the lingo, the present result is called an interaction effect (which you may read more about here).

Extensions to multiple x-variables

In much (sports) data science and research, we want to scrutinize how several x-variable affect the sports outcome variable simultaneously. Multiple regression is then called for, and that might be the topic of a later post; stay tuned!

Caveat

I have not discussed the issue of statistical significance in this post. The reason is that I only wanted to describe the population data in question — I did not have any hypothetical superpopulation in mind. That said, it could be argued that significance testing does not (or should not) play a part of statistical analysis of population data.

Takeaways and implications

Three statistical techniques may be used in a data science or research contexts when the aim is to statistically explain variation in sports outcome variables: (1) comparison of means/ANOVA, (2) cross-tabulation, and (3) regression analysis. Of course, there are many more such techniques, but these will take you to your goal (!) most of the time.

About me

I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. I am on ResearchGate here, and you also reach me at christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)