DATA STORIES | SOCCER ANALYTICS | KNIME ANALYTICS PLATFORM

Was Spain’s win at Euro 2024 predictable?

An evaluation of the top teams of Euro 2024 using Linear Regression in KNIME to see how this approach outperforms FIFA ranking and provides a more accurate method for rating soccer teams

Dennis Ganzaroli
Low Code for Data Science

--

Fig. 1: Data Scientist predicts the victory of Spain (image by author with Image Creator in Bing).

Spain won its fourth men’s UEFA European Championship title by breaking English hearts with a 2–1 win over the Three Lions in the 2024 final. A late goal by substitute Mikel Oyarzabal clinched the victory for the Spanish at the Olympiastadion in Berlin on Sunday 14 July, 2024.

The title victory seemed well deserved, as they won practically all of their matches. But what were Spain’s chances of winning before the tournament?

The FIFA Ranking

If we look at the FIFA Ranking before the tournament (and we just look at the European UEFA Teams), then we see that France was on first place followed by Belgium and England. Spain was just sixth behind the Netherlands.

Fig. 2: FIFA Ranking of UEFA Teams before the tournament (image by author from FIFA.com).

This ranking is not so bad if you use it to predict the winner of the Knockout Stage of Euro 2024.

By taking the team with the better ranking, you would pick 11 games of 15 right for a hit rate of 73%. But you would fail to pick Spain as the winner of the Euro 2024.

Fig. 3: Picking Winners of KO-Stage with the FIFA Ranking (image by author from Wikipedia).

The FIFA ranking is sponsored by Coca-Cola; as such, the FIFA/Coca-Cola World Ranking name is also used.

The Fifa Ranking is based on the Elo Rating system which is also used in Chess to rate players. After each game, points are added to or subtracted from a team’s rating according to the formula:

Fig. 4: Formula to calculate points for FIFA Ranking (image adapted from FIFA Ranking).

Points are awarded or deducted after each match based on the result, the relative strength of the two teams (using the Elo formula), and the match’s importance. (Friendlies have less importance than games played in tournaments or qualifying games.)

A weak team can defeat a strong team in a surprising game and thus greatly increase its rating. Because if W=1 (win) — We=0 than the addition term will be high.

The Elo rating system updates teams rankings based on match outcomes relative to expectations.

The Linear Regression Rating

In Sports Analytics there is another proven way to rate teams and this approach is called linear regression method.

Team ratings are calculated by minimizing the difference of predicted and actual results.

Fig. 5: Formula to calculate Linear Regression Rating (image by author).

The goal here is to find the optimal solution for all equations derived from the game results by minimizing the errors.

The FIFA Ranking like the Elo Rating adjusts team rankings based on match results relative to expectations, whereas a linear regression rating predicts team performance by modeling the relationship between a matrix representing the teams (with 1 for the home team, -1 for the away team, and 0 otherwise) and the score difference.

Building the dataset

But before we can start to calculate our ratings we need to choose the right dataset. To do this, we need to know how the tournament runs and when it started.

Fig. 6: Different stages of Euro 2024 (image by author).

There are two phases of the tournament. In the qualifying phase, which began on March 23, 2023, 53 teams played for the 23 available spots. Germany as host of Euro 2024 was automatically qualified and did not have to play any qualifying games.

The group phase of the tournament began on June 14, 2024 in Germany and ended after the play-off phase with the final between Spain and England on July 14.

The goal is to calculate the rating of the teams between the start of the qualifying and the start of the tournament phase.

If we look deeper in the Qualifying we see that the 53 Teams were divided into ten groups. The games were played in a home-and-away, round-robin format.

Fig. 7: Qualifying for Euro 2024 (image by Wikipedia).

20 teams, composed of 10 group winners and 10 runners-up, qualified directly for Euro 2024 (Teams on green background). 3 additional teams advanced to the play-offs based on their performance in the UEFA Nations League 2022–2023 (teams on blue background) and finally, as host of Euro 2024, Germany did not have to play any qualifying matches.

We therefore have two issues:

  1. we don’t have any games to rate Germany
  2. every team is just playing against teams from his own group
Fig. 8: Network of Qualifying games (image by author).

So how to rate Germany and how to compare different Group members?The answer is including friendlies!

Now we not only have a link between Germany and teams from Group B, but we also have an indirect link between teams from Group B and Group F via the games against Germany.

Fig. 9: Including friendly games (image by author).

So let’s adjust our formula with the term f which implements the weight for the friendly games. Since triendlies have less importance, they will have less weight in the calculation of our ratings.

Fig. 10: Building the matrix for the Linear Regression model (image by author).

Then we have to transform our games results table in a way that we can solve our linear system of equations:

  • For the home advantage we add a 1.
  • For friendlies a 1 otherwise a 0.

We code than the teams in the following way:

  • 1 if the team plays at home
  • — 1 if the team plays away
  • 0 if the team didn’t play

The resulting coefficients of the x-matrix provide the final rating of the teams.

Let’s now wrap everything in KNIME!

Fig. 11: The KNIME Workflow for calculating the ratings (image by author)

We open the KNIME Workflow on my KNIME Hub:

  1. In KNIME, we first load the games. The data is provided by Kaggle. They have collected all international soccer results from 1872 to the present day. We have a part with the results and a part with the shootouts that we join together.
  2. Then we filter the games between the start of the qualifying and before the start of the tournament.
  3. We add a self-made table with the teams and their FIFA Confederation because we want to choose only games that have either the home-team or the away-team in the UEFA Confederation.
  4. Finally we build the matrix for the linear regression learner and calculate the ratings.
Fig. 12: Calculating the ratings with the Linear Regression Learner (image by author).

In the Linear Regression Node we put as output column the difference of the score (dfres). In the input columns we put the column with the homefield advantage, the friendly and all teams that have played against each other

We can see that the predicted results correlate very well with the actual results at 0.77.

The HF advantage has a coefficient of 0.44, which means that if the teams are equally strong, the home teams should be 0.44 score better. And in friendly games the away team is 0.177 scores better than usual.

With this model we get the following team ranking table: Spain is in first place, followed by France and Portugal. England is in fifth place and the Netherlands in seventh.

If we use these rankings to predict the winner of the KO stage again, this time we are correct in 13 out of 15 cases for a hit rate of 87%.

And most of all: we are now picking Spain as the winner of the Euro!

Fig. 13: Picking winners of KO-Stage with new Rankings (image by author).

We’ve seen that our linear regression ratings have performed better than the FIFA Ranking. But how would we have done compared to the bookmakers’ odds?

Spain was not at the top of the bookmakers’ list either and is in 6th place, as in the FIFA Ranking. England was set for first place, perhaps also because many bettors believed England would get their revenge after their defeat in the final of the last European Championship.
But the math knew better.

Fig. 14: Bookmakers Pre-tournament Odds for Euro 2024 (image by author).

Nevertheless, this system is still not perfect. Certain surprises like Switzerland and Turkey were not correctly predicted. In addition, all games were weighted equally regardless of their recency.

There is certainly still room for improvement, which we will explore in more detail at a later date.

Thanks for reading and may the Data Force be with you!
Please feel free to share your thoughts or reading tips in the comments.

Follow me on Medium, LinkedIn or Twitter and follow my Facebook Group “Data Science with Yodime”.

Material for this project:

References:

--

--

Dennis Ganzaroli
Low Code for Data Science

Data Scientist with over 20 years of experience. Degree in Psychology and Computer Science. KNIME COTM 2021 and Winner of KNIME Best blog post 2020.