Predicting FIFA 2022 World Cup with Machine Learning

Sergio Pessoa
LatinXinAI
Published in
11 min readNov 18, 2022
@akeenster

With the approach of FIFA 2022 World Cup, the interest and discussions about which team is going to win the championship increase. Thus, I decided to test my Data Science and Machine Learning skills to develop a model analyzing historical data to simulate all games from FIFA 2022 World Cup.

Data Source

In order to build a Machine Learning model, we need the data from the teams. First, we need something that tells the performance of the teams, which can be extracted from previous games. Also, I decided to use the FIFA Rankings in the construction of the features. They will be helpful to quantify the quality of the opponent that the team faced in a game. Both data can be found on Kaggle: past games and FIFA rankings.

Dataset building

Which features can bring impact to decide the winner of a football game? This question has a very open answer. From the selected players to the temperature at the stadium during the day, everything can make an impact on the outcome. Then, I chose to build a dataset only with the past stats from each team involved in the game, prioritizing the quantifiable stats that can be gathered in a simple way, such as goals made, mean ranking faced, points won, and others that will be detailed. Those data can be found in the join of both datasets that I’ve talked about in the last section.

Also, only the performance at the 2022 World Cup cycle will be analyzed. The idea is to take into account only the variation of performance in preparation for the World Cup.

import pandas as pd
import re
df = pd.read_csv("games/results.csv") #games between national teams
df["date"] = pd.to_datetime(df["date"])
df = df[(df["date"] >= "2018-8-1")].reset_index(drop=True) #games at the 2022 wc cycle
df_wc = df #pre-wc outcomes

rank = pd.read_csv("fifa_ranking-2022-10-06.csv") #rankings
rank["rank_date"] = pd.to_datetime(rank["rank_date"])
rank = rank[(rank["rank_date"] >= "2018-8-1")].reset_index(drop=True) #selecting games from the 2022 wc cycle
rank["country_full"] = rank["country_full"].str.replace("IR Iran", "Iran").str.replace("Korea Republic", "South Korea").str.replace("USA", "United States") #ajustando nomes de algumas seleções
rank = rank.set_index(['rank_date']).groupby(['country_full'], group_keys=False).resample('D').first().fillna(method='ffill').reset_index()
rank_wc = rank #dataframe with rankings

#Making the merge
df_wc_ranked = df_wc.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "home_team"], right_on=["rank_date", "country_full"]).drop(["rank_date", "country_full"], axis=1)
df_wc_ranked = df_wc_ranked.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "away_team"], right_on=["rank_date", "country_full"], suffixes=("_home", "_away")).drop(["rank_date", "country_full"], axis=1)

Above is the code to create the wanted dataset.

View of the database

The dataset is ready, as you can see above, and there is more information about the away team. It’s important to remember that the majority of games between national teams are played on neutral venue, but I’ll use the names “home team” and “away team” to simplify the way in which I’ll talk about the teams involved.

Features Development

Now, I need a set of candidate features. With them I can make an analysis that will indicate if the feature has predictive power or not, to understand if I need to keep, remove, or modify the feature. The features I calculated as candidates were:

  • Goals average at World Cup cycle and in the last 5 games.
  • Suffered goals average at World Cup cycle and in the last 5 games.
  • Difference in FIFA ranking positions between each team.
  • FIFA ranking that each team faced average at World Cup cycle games and in the last 5 games.
  • Increment in points in FIFA Ranking from the first game of the cycle and now.
  • Increment in points in FIFA Ranking 5 games ago and now.
  • Game points average won at World Cup cycle and in the last 5 games.
  • Game points average won weighted by ranking position faced at World Cup cycle and in the last 5 games.
  • Categorical variable telling if the game was a friendly or not.

The first two features are used to quantify the offensive power and defensive power of a team. The difference between FIFA ranking position in the match is used to quantify the difference between both teams’ strengths as calculated by FIFA. Ranking faced average is used to put in analysis the strengths of opponents that a team faced.

The increment in FIFA ranking points was calculated to analyze the team quality increment during the World Cup cycle and in the last 5 games.

The game points average the team won is to purely quantify the team’s performance, while the weighted average of game points a team won is weighted by the opponent’s ranking position that the team faced, in order to analyze if the performance of the team is high due to low-level opponents.

To build features based on FIFA rankings, there are two choices: use FIFA ranking points, or use FIFA ranking position. I opted to use the ranking position in all features except the increment in ranking. I didn’t create the same feature changing only FIFA ranking position by FIFA ranking points because these columns are very negatively correlated, as you can see below.

Correlação das colunas de ranking

Then, it doesn’t make sense to create two versions of features like the weighted average game points by the ranking using rankings’ points and rankings’ positions, because they would yield the same result.

Data Analysis

Before the modeling, it’s needed to analyze what will be predicted. However the ideal is to predict between win, draw and lose, a 3-class classification problem is very hard to analyze and evaluate. Then, I decided to make the prediction between two classes: home team win and home team draw/lose.

To analyze the features in relation to the target, I’ll use violin plots and boxplots. The idea is to analyze how their distributions behave in relation to the values of each class, and if they can separate well the data.

Violin plots of the first set of data

For the first set of data, the already created, only rank_dif, the difference between both teams’ rankings, has an impact on the target classes. Thus, I will create more features based on differences, since it seems that they are good predictors:

  • Goals difference.
  • Suffered goals difference.
  • Difference between goals made by a team and suffered by the opponent.

And again I analyzed the violin plots.

Violin plot of the new features

The difference between goals and suffered goals features have a good impact. However, the features that map the differences between goals made by a team and suffered by the opponent have no impact. Therefore, we have, now:

  • Ranking difference.
  • Goal difference at the World Cup cycle and in the last 5 games.
  • Suffered goals difference at the World Cup cycle and in the last 5 games.

Also, we could do the difference of points, difference of ranking position faced, and difference of points won by ranking faced. And, to weigh the level of the opponent, I thought of the following feature: difference between goals made by ranking faced, and the same to goals suffered.

The view by violin of this data is not very effective, due to the small scale. Then, I’ll analyze the boxplot of these data too.

Violin and Boxplot for the different features

Points difference, goals difference by ranking, points difference by ranking, and difference of ranking faced are good features. But, their version for the last 5 games and their version of the full cycle seems to be correlated. I’ll check this below.

Correlation of the features

Analyzing the correlation, it’s visible that in the difference of goals made by ranking the ideal is to choose only one of the versions, and I chose the one that considers the full cycle. For the others, it’s possible to use both versions.

Then, we have, as features:

  • Difference between rankings (rank_dif)
  • Difference between goals’ average at the World Cup Cycle and in the last 5 games (goals_dif/goals_dif_l5)
  • Difference between suffered goals’ average at the World Cup Cycle and in the last 5 games (goals_suf_dif/goals_suf_dif_l5)
  • Difference between the average position of ranking faced at the World Cup Cycle and in the last 5 games (dif_rank_agst/dif_rank_agst_l5)
  • Difference between the goals weighted by the ranking faced’s average at World Cup Cycle (goals_per_ranking_dif)
  • Difference between the points weighted by the ranking faced’s average at World Cup Cycle and in the last 5 games (dif_points_rank/dif_points_rank_l5)
  • Categorical variable that indicates if it’s a friendly or not (is_friendly)

With that, we have a database with the needed features to apply the Machine Learning model.

Some of the features of the database

Model

My idea here is to build two models, one of Random Forest and another of Gradient Boosting, and compare them to find which one is better, in order to be used in the simulation. I decided to use models based on decision trees because they did better in football problems when I studied the literature. Also, I don’t see the need to use more complex models, because of the size of the dataset.

I’ll do a parameter variation using SkLearn’s GridSearchCV, and will use the best model at the simulation.

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

#separating the target from the features
X = model_db.iloc[:, 3:]
y = model_db[["target"]]

#dividing the database
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=1)
gb = GradientBoostingClassifier(random_state=5)
params = {"learning_rate": [0.01, 0.1, 0.5],
"min_samples_split": [5, 10],
"min_samples_leaf": [3, 5],
"max_depth":[3,5,10],
"max_features":["sqrt"],
"n_estimators":[100, 200]
}
gb_cv = GridSearchCV(gb, params, cv = 3, n_jobs = -1, verbose = False)
gb_cv.fit(X_train.values, np.ravel(y_train))

#getting the best model
gb = gb_cv.best_estimator_

I avoided testing a lot of parameters due to the delay in execution, prioritizing tests with values that reduce overfit, such as a learning_rate not so low and a n_estimators not so high.

Did the same to the Random Forest:

params_rf = {"max_depth": [20],
"min_samples_split": [5, 10],
"max_leaf_nodes": [175, 200],
"min_samples_leaf": [5, 10],
"n_estimators": [250],
"max_features": ["sqrt"],
}

rf = RandomForestClassifier(random_state=1)
rf_cv = GridSearchCV(rf, params_rf, cv = 3, n_jobs = -1, verbose = False)
rf_cv.fit(X_train.values, np.ravel(y_train))

I analyzed the models with confusion matrix and ROC curve, and the results were:

analyze(gb)
Gradient Boosting results
analyze(rf)
Random Forest results

Random Forest model has a slightly better performance but it seems to overfit. Analyzing AUC-ROC of Gradient Boosting, we see a model with almost the same performance but lower risk to overfit, that’s why it was chosen.

World Cup Simulation

Now, we arrived to the most interesting part: see which team the model will predict to win the World Cup!

The first thing to do is obtain the list of teams playing World Cup, and I used the method read_html, from Pandas. The method takes a dataframe from a web page, in which I put the Wikipedia World Cup page. With that, I’ll recreate the World Cup table.

The table has the games, the points of each team in the group and a list storing the probabilities of the team winning each game. This will be used as tiebreaker in the case of two teams with the same points in the group.

The first four groups in the created table and the first ten games of the World Cup

As I already explained, the model makes the classification between win from home team and win/draw from away team. Then, how can we predict draws? I created an objetive rule for this: knowing that all games from the World Cup are played on neutral venue, the prediction will be made in two forms:

  • Team A x Team B (Simulation 1)
  • Team B x Team A (Simulation 2)

If Team A or Team B wins both predictions, then the win is assigned to that team. In case one team wins in the first prediction and the other team wins in the second prediction, then it will be assigned as a draw. In the playoffs phase, the probabilities will be calculated in both predictions, and the team with the highest average probability advances. Since the model assigns a “win” to the away team even if it draws, the draw probability is inside the away team’s probability. So, with this type of simulate, both teams get a simulation with the draw as an advantage.

The data used in the simulation is until the last game of the team. In other words, for Brazil, the features will be calculated until the game against Tunisia, which was the last one that Brazil played.

Now, we can run a code simulating game-by-game, calculating the points and seeing what’s happening in the first phase.

The outcome was:

It’s interesting to see some results, such as the draw between Brazil and Switzerland and Denmark and France. In general, the favorites passed the group phase.

In the playoffs, the idea is the same:

To see the results here, besides the text output, I decided to plot a graph with the playoff picture as done in this Kaggle notebook. It’s a very interesting way to see the outcomes of these problems.

The World Cup was simulated! My model is predicting a win from Brazil, with 56% of probability against England in the final! I think the biggest upsets were Belgium passing against Germany and England making it to the final, eliminating France in the quarter-final. It’s interesting to see some games in which the probability is very tight, as Netherlands vs. Argentina. From the quarter-final until the final, no team advanced with more than 60% of probability, which shows that the majority of the teams that advanced to playoffs are at a similar level.

Playoff picture

Conclusion

The idea of this project was to practice my knowledge in Machine Learning using something that I like, football. I thought that simulating the World Cup was very interesting, for being a trending topic nowadays that attracts the attention of everyone that likes the sport. I believe that the goal was achieved, since the construction of all features and the data analysis brought me the opportunity to search and meet a lot of new techniques.

About the results, I’m hoping a lot that the model predicts right the champion!

In case you can see the code in detail, you can see my GitHub and my Kaggle. And if you want to reach me, that’s my LinkedIn. Thanks!

LatinX in AI (LXAI) logo

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Don’t forget to hit the 👏 below to help support our community — it means a lot!

Thank you :)

--

--

Sergio Pessoa
LatinXinAI

Data Science and Sports Analytics enthusiast. Computer Engineer graduated @UFPE. Data Scientist @atletico