Soccer as a clean laboratory for doing — and visualizing — statistical analysis: Volume 2: Predicting goals

In this second blog post of mine from the world of soccer (the first one is here), I examine goal scoring and some of its determinants. The data I analyze refer to a sample of male players in the top-tier Norwegian soccer league. These data have been extensively analyzed elsewhere, but for the present purposes we discard the goalkeepers from the sample. That leaves 217 players, or observations, for analysis. I will post more blogs from these data later, with the modest aim of showing how data on sports have much to offer in terms of demonstrating the workings of several statistical techniques in an intuitive and non-technical manner.

Preliminaries

Scoring goals is the currency of note in the game of soccer. You might have read that goals scored in soccer matches can be thought of as some kind of random phenomenon following the so-called Poisson distribution. Well, you got that right, but in this blog post the units are not soccer matches. That is, we are looking at how goal-scoring is related to the background characteristics of soccer players (i.e., the players are the units), and that’s an entirely different ballgame. (Parden the pun.) The histogram for our goal-scoring variable, or the total number of goals scored during the season, appears in Figure 1. Note how not scoring a goal is most typical among the players, that is, the vertical bar on the left-hand side. The mean is 2.24 goals, as in the vertical dashed line.

Figure 1 Histogram for total number of goals scored in season.

The x-variables (predictors) and the statistical machinery for prediction

Many factors or predictors (or x-variables) might affect the number of goals a soccer player score in a season. We look at three such factors: (1) number of games played in the season (average = 20.5; range 2–30), (2) player position, and (3) player origin.

I use multiple regression to find out how these three x-variables simultaneously are associated with goal-scoring. But since I am interested in the predictions from this model, I will not go into the details of the inner workings of multiple regression. (I will return to that in an upcoming blog post using the very same data, though.) Note also that traditional multiple regression is not ideal for analyzing the number of goals variable. The reasons for this are explicated in the next paragraph in parenthesis, which you might skip if you want to get straight to the results.

(A methodological interim: The goal-scoring variable, y, is a count variable; the number of times something happened. Such discrete variables are typically modelled by, yes, count-data regression models in the social and behavioral sciences, which are the sciences I know a little about. In this regard, there are two types: the Poisson model (yeah, that one; hereafter PM) and the negative binomial model (NBM). The NBM is preferred when the variance of y is larger than its mean (called overdispersion), which typically is the case. But it does not stop there for our example, unfortunately. Since the goal-scoring variable has “excess zeroes” (most players do not score a goal, as in zero goals), we should probably employ the zero-inflated negative binomial regression model. I kid you not! But that’s something for another blog post. A longish parenthesis indeed …)

The results (predictions)

My first and plain-vanilla multiple regression model did not produce optimal results. That is, it gave too many negative predictions (i.e., goals), which are possible in the game of mathematics but not so much in the game of soccer. To overcome this, I estimated a model allowing for interaction effects–to be explained below. Let’s first focus on the x-variables number of games played in the season and player position, as in Figure 2. Note also that since we are considering the effect of three x-variables simultaneously, we can only depict the predictions in a stylized fashion, that is, without showing the actual data points.

Figure 2 Total number of goals scored by number of matches played and player position, based on a multiple regression model also including player origin and interaction variables.

In Figure 3, the variable player origin replaces the variable player position. We note how the multiple regression model predicts that foreign players (i.e., the top regression line) score more goals than Norwegian players.

Figure 3 Total number of goals scored by number of matches played and player origin, based on a multiple regression model also including player position and interaction variables.

Takeaways

By means of multiple regression analysis, the three variables number of games played in the season, player position, and player origin may be used to predict the number of goals a soccer player scores during a season. There are also interaction effects in play here (sorry, could not resist that one!).

That’s it for today, folks. Stay tuned for more upcoming analyses of sports and soccer data. By the way, I do all my statistical computing in Stata. But I hope to learn R someday … I also hope to learn to upload my datasets on GitHub … And to provide more fancy graphs …

About me: I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. I am on ResearchGate here, and you also reach me at christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)