What were the best predictors of Euro 2020?
Something I have always been fascinated by is the difference between international football and club football. As an avid football supporter, you will always get this sense that there is a different feel to the games, but it is difficult to quite put your finger on what that is.
Opta’s Jonny Whitmore covered this following the Euro 2020 tournament this summer in his article called 5 Key Differences Between International and Club Football. It’s a great piece that uses data to identify many of the differences football people acknowledge such as preparation time, the inability to make transfers, and player ages. One point I found particularly eye-catching was how he showed club teams have on average a more vertical playing style than international teams:
Another article which inspired me is a 2017 article by football mathematician David Sumpter titled “How important is it to have the ball?”. In his piece, he plots possession against goal difference to show that possession was in general not a good predictor for football matches, particularly for the Premier League, but also for Euro 2016. Here is an example plot from his post using data from the 15–16 and 16–17 Premier League seasons:
This made me wonder, are there any better predictors? Or is ball possession a particularly bad one? How does it compare to the almighty xG, for example?
And this is how I came up with my first publication idea: to rank the best “predictors” of Euro 2020.
I had data on Euro 2020 because throughout the past summer I hand-recorded stats off of Fotmob. You can find the data here. It contains match statistics for all 51 matches of the tournament such as possession, successful passes, xG, shots attempted, and more.
I then wrote a program that plotted each match statistic against match goal difference (same as Sumpter) and overlayed the best fit line (using a linear regression) in red, shown below:
Just by looking, you can tell which variables are negatively or positively correlated with goal difference. The downward slope for Fouls Conceded, for example, means committing more fouls was associated with losing by more goals. The opposite is true for Opposition Half Passes, where more of them was associated with winning by more goals.
This aligns well with domain knowledge. If you are committing more fouls, you are probably being caught out defensively more than your opposition who are likely getting into bothersome positions and drawing fouls. On the other hand, if you make more opposition half passes you are playing on the front foot with the ball closer to your opponent’s net, where you are more likely to create chances and score goals.
This is a nice visual, but recall I wanted to rank which variables were most strongly related with winning, not just see what is positively and negatively related. Thankfully, a measurement of how strongly related two variables are correlated — known as the Pearson correlation coefficient or Pearson’s r — captures this.
This value ranges between -1 and 1, where -1 indicates a perfect negative relationship and a 1 indicates perfect positive relationship. For our purposes, this means the closer to 1 an r value for a particular variable is, the more strongly it is correlated to higher goal difference (winning by more goals). The closer it is to -1, the more strongly it is correlated to lower goal difference (losing by more goals).
However, be careful to not read r values close to 0 as indications of drawing. Take the plot of Crosses, which is relatively flat and has an r value close to 0. Near-zero r values are meant to suggest the variables are in fact not related linearly.
In other words, the number of crosses teams made in Euro 2020 had essentially no relationship with winning or losing (at least linearly). It doesn’t mean more or less crosses were correlated with drawing. On a side note, this came as a slight surprise to me, because the large majority of goals come from crosses. Perhaps this has something to do with how Fotmob defines what a ‘Cross’ really is.
Moving along, below you can find the complete table of the match statistics along with their r value and p value. It is in increasing order, so you will find the stats with the strongest relationship to winning at the bottom of the right-hand side of the table and the strongest relationship to losing from the top-left.
For those unfamiliar, a p-value is a measure of statistical significance. It is important to report this figure when using linear regression because it helps quantify how likely the result is due to chance. A high p-value indicates that the sample data do not show there is a relationship between the independent variables and the dependent variables. So, to prune this list down, I chose to remove variables with a p-value greater than 0.05 (a commonly used threshold), which results in the following list (unfortunately wiping out all the stats with a negative correlation):
Let’s take this one step further to the final output. Typically, statisticians categorize r-values into several “buckets”, each with different levels of the strength of relationship they reflect. For example, an r-value larger than 0.7 generally means the variables have a strong relationship. Below is a common framework for these buckets which I will adhere to:
Placing the stats into these buckets, we get the following:
At last, we have ranked the stats which were the least and most correlated to goal difference for Euro 2020. Admittedly, the sample size is small (102 data points), but that is the nature of analyzing a single international tournament for a project. Before reading on to my thoughts, take a minute yourself to see what aligns or misaligns with your understanding of the tournament.
My thoughts
In my opinion, the low ranking of the passing stats is surprising. Pass success, for instance, is not only in the ‘None or very weak’ category, but also has the lowest r-value of all the statistically significant variables (0.197). The total Number Passes and Accurate Passes stats are also in that category.
All of these passing stats are actually lower ranked than Ball Possession (.306), which I found surprising because we often hear how well you use the ball is more important than how much of it you have. So why does it seem to be the other way around in the data? Perhaps it is the sample size, but more likely perhaps these passing stats are not such a great proxy of how well a team uses the ball, after all.
My next article, Evaluating Passes, will be related to this matter.
Another neat aspect in my eyes was the progression of shooting statistics through the strength rankings. Total shots, the ‘broadest’ shooting stat, falls in the weakest category. Shots inside the box, a more specific stat relative to scoring goals, falls in the category above it. We then see Shots on Target, the shooting stat “closest” to scoring, in the Moderate category. I find this aligns quite well with domain knowledge.
Lastly, I noted that xG appears in a higher ranking than the Chances Created stat. This gives me some added confidence in Fotmob’s xG model, since xG is supposed to be a measure of the quality of those chances.
Conclusion
To reiterate, my goal with this project was to get started interrogating Euro 2020 data using statistics. While discovering a surprising result would have been great, it was not my primary goal. After all, we know that in football there is no silver bullet to winning football matches, so it would be unreasonable to expect a single stat to appear as such. This was simply my first attempt at trying to build up a picture of what was going on in football data that aligns with footballing knowledge.
As Sumpter puts at the end of his article, “it isn’t the numbers themselves that allow you to understand football, but the way you use statistics to understand the numbers behind the game”.
Thanks for reading! Any advice or feedback is greatly appreciated.