DATA STORIES | SOCCER ANALYTICS | KNIME ANALYTICS PLATFORM
Was Spain’s win at Euro 2024 predictable?
An evaluation of the top teams of Euro 2024 using Linear Regression in KNIME to see how this approach outperforms FIFA ranking and provides a more accurate method for rating soccer teams
Spain won its fourth men’s UEFA European Championship title by breaking English hearts with a 2–1 win over the Three Lions in the 2024 final. A late goal by substitute Mikel Oyarzabal clinched the victory for the Spanish at the Olympiastadion in Berlin on Sunday 14 July, 2024.
The title victory seemed well deserved, as they won practically all of their matches. But what were Spain’s chances of winning before the tournament?
The FIFA Ranking
If we look at the FIFA Ranking before the tournament (and we just look at the European UEFA Teams), then we see that France was on first place followed by Belgium and England. Spain was just sixth behind the Netherlands.
This ranking is not so bad if you use it to predict the winner of the Knockout Stage of Euro 2024.
By taking the team with the better ranking, you would pick 11 games of 15 right for a hit rate of 73%. But you would fail to pick Spain as the winner of the Euro 2024.
The FIFA ranking is sponsored by Coca-Cola; as such, the FIFA/Coca-Cola World Ranking name is also used.
The Fifa Ranking is based on the Elo Rating system which is also used in Chess to rate players. After each game, points are added to or subtracted from a team’s rating according to the formula:
Points are awarded or deducted after each match based on the result, the relative strength of the two teams (using the Elo formula), and the match’s importance. (Friendlies have less importance than games played in tournaments or qualifying games.)
A weak team can defeat a strong team in a surprising game and thus greatly increase its rating. Because if W=1 (win) — We=0 than the addition term will be high.
The Elo rating system updates teams rankings based on match outcomes relative to expectations.
The Linear Regression Rating
In Sports Analytics there is another proven way to rate teams and this approach is called linear regression method.
Team ratings are calculated by minimizing the difference of predicted and actual results.
The goal here is to find the optimal solution for all equations derived from the game results by minimizing the errors.
The FIFA Ranking like the Elo Rating adjusts team rankings based on match results relative to expectations, whereas a linear regression rating predicts team performance by modeling the relationship between a matrix representing the teams (with 1 for the home team, -1 for the away team, and 0 otherwise) and the score difference.
Building the dataset
But before we can start to calculate our ratings we need to choose the right dataset. To do this, we need to know how the tournament runs and when it started.
There are two phases of the tournament. In the qualifying phase, which began on March 23, 2023, 53 teams played for the 23 available spots. Germany as host of Euro 2024 was automatically qualified and did not have to play any qualifying games.
The group phase of the tournament began on June 14, 2024 in Germany and ended after the play-off phase with the final between Spain and England on July 14.
The goal is to calculate the rating of the teams between the start of the qualifying and the start of the tournament phase.
If we look deeper in the Qualifying we see that the 53 Teams were divided into ten groups. The games were played in a home-and-away, round-robin format.
20 teams, composed of 10 group winners and 10 runners-up, qualified directly for Euro 2024 (Teams on green background). 3 additional teams advanced to the play-offs based on their performance in the UEFA Nations League 2022–2023 (teams on blue background) and finally, as host of Euro 2024, Germany did not have to play any qualifying matches.
We therefore have two issues:
- we don’t have any games to rate Germany
- every team is just playing against teams from his own group
So how to rate Germany and how to compare different Group members?The answer is including friendlies!
Now we not only have a link between Germany and teams from Group B, but we also have an indirect link between teams from Group B and Group F via the games against Germany.
So let’s adjust our formula with the term f which implements the weight for the friendly games. Since triendlies have less importance, they will have less weight in the calculation of our ratings.
Then we have to transform our games results table in a way that we can solve our linear system of equations:
- For the home advantage we add a 1.
- For friendlies a 1 otherwise a 0.
We code than the teams in the following way:
- 1 if the team plays at home
- — 1 if the team plays away
- 0 if the team didn’t play
The resulting coefficients of the x-matrix provide the final rating of the teams.
Let’s now wrap everything in KNIME!
We open the KNIME Workflow on my KNIME Hub:
- In KNIME, we first load the games. The data is provided by Kaggle. They have collected all international soccer results from 1872 to the present day. We have a part with the results and a part with the shootouts that we join together.
- Then we filter the games between the start of the qualifying and before the start of the tournament.
- We add a self-made table with the teams and their FIFA Confederation because we want to choose only games that have either the home-team or the away-team in the UEFA Confederation.
- Finally we build the matrix for the linear regression learner and calculate the ratings.
In the Linear Regression Node we put as output column the difference of the score (dfres). In the input columns we put the column with the homefield advantage, the friendly and all teams that have played against each other
We can see that the predicted results correlate very well with the actual results at 0.77.
The HF advantage has a coefficient of 0.44, which means that if the teams are equally strong, the home teams should be 0.44 score better. And in friendly games the away team is 0.177 scores better than usual.
With this model we get the following team ranking table: Spain is in first place, followed by France and Portugal. England is in fifth place and the Netherlands in seventh.
If we use these rankings to predict the winner of the KO stage again, this time we are correct in 13 out of 15 cases for a hit rate of 87%.
And most of all: we are now picking Spain as the winner of the Euro!
We’ve seen that our linear regression ratings have performed better than the FIFA Ranking. But how would we have done compared to the bookmakers’ odds?
Spain was not at the top of the bookmakers’ list either and is in 6th place, as in the FIFA Ranking. England was set for first place, perhaps also because many bettors believed England would get their revenge after their defeat in the final of the last European Championship.
But the math knew better.
Nevertheless, this system is still not perfect. Certain surprises like Switzerland and Turkey were not correctly predicted. In addition, all games were weighted equally regardless of their recency.
There is certainly still room for improvement, which we will explore in more detail at a later date.
Thanks for reading and may the Data Force be with you!
Please feel free to share your thoughts or reading tips in the comments.
Follow me on Medium, LinkedIn or Twitter and follow my Facebook Group “Data Science with Yodime”.
Material for this project:
- KNIME-workflow: KNIME-Hub
References:
- KNIME
- KNIME Getting Started Guide - Euro 2024
- UEFA Euro 2024 Wikipedia - Regression Analysis
- Basic Regression Models - Football Rating
- Pronostics footballistiques — Diego Kuonen 1996
- The Fifa Ranking Wikipedia
- If you want to be a Data Scientist change Hobbies — Dennis Ganzaroli 2021