Predicting the Outcomes of European Football Matches with ML

23 min readDec 15, 2022

By: Sonali Singh, Madelyn Dempsey, and Andrew Antenberg

Who will win? It’s a question that has plagued sports betters (and likely still will) for ages. Football, as a game is oftentimes unpredictable. That’s the beauty in it. The great underdog stories like Liverpool against Milan (in 2005) or Morocco, making it all the way to the 2022 FIFA World Cup Semifinals — these stories are what keep people watching and have defined soccer history.

But… oftentimes unpredictable doesn’t mean that there aren’t favorites. It doesn’t mean that football isn’t a science that can be analyzed with statistics and team metrics. The ongoing debate about which team will prevail and which team will sink to their knees in defeat is not one without discussion on the various strengths and weaknesses of each team: the offensive creativity of Lionel Messi’s Argentina or the defensive excellence of Paris Saint Germain. So, we thought it would be interesting to analyze the question of whether or not match outcomes (down to goal differences!) could be predicted using… well… science. We were curious about a few questions. Could we find data that gave football teams numerical scores for attributes like offensive or defensive strength? Or data that could tell us, once and for all, whether Neymar or Iniesta had stronger dribbling skills? And most of all — could we combine this data on teams and players (and years of match outcomes) to accurately forecast the game winners and the goal differences across matches in various European Soccer Leagues?

To answer these questions, we ventured out on the tried-and-true path of many data scientists to ultimately use various machine learning techniques to build an accurate classification model! Read on to see if we were able to do it (hint: we got pretty close!).

Part 1. Getting Data

The first step was getting data. We wanted as much data as we could get in three areas: (1) Match results (and details) for seasons in the biggest European Football leagues (think the English Premier League, German Bundesliga, Spanish LaLiga, etc.), (2) Player statistics for each one of the players that played in those games (ex; speed, finishing power, etc.), and (3) Recent team-wide statistics for each one of the teams that played in those games (ex; defensive power, offensive passing ability, etc.).

We found a great database on Kaggle that gave us information on over 26,000 matches played across eight seasons (2008–2016) in eleven (!) different European leagues: Belgium’s Jupiler League, England’s Premier League, France’s Ligue 1, Germany’s Bundesliga, Italy’s Serie A, Netherlands’ Eredivisie, Poland’s Ekstraklasa, Portugal’s Liga ZON Sagres, Scotland’s Premier League, Spain’s LaLiga, and Switzerland’s Super League.

We also found supplementary datasets that included numerical scores for player and team attributes from EA Sports’ FIFA Video Game. A sample of these statistics:

Player attributes: crossing, finishing, dribbling, free kick accuracy, acceleration, sprint speed, shot power, marking, stamina, etc.

Columns: 
['player_api_id', 'date', 'overall_rating', 'potential', 'preferred_foot', 
 'attacking_work_rate', 'defensive_work_rate', 'crossing', 'finishing', 
 'heading_accuracy', 'short_passing', 'volleys', 'dribbling', 'curve', 
 'free_kick_accuracy', 'long_passing', 'ball_control', 'acceleration', 
 'sprint_speed', 'agility', 'reactions', 'balance', 'shot_power', 'jumping', 
 'stamina', 'strength', 'long_shots', 'aggression', 'interceptions', 
 'positioning', 'vision', 'penalties', 'marking', 'standing_tackle', 
 'sliding_tackle', 'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
 'gk_reflexes', 'gk_overall']

Team attributes: speed of build up play, passing within build up play, chance creation shooting, chance creation crossing, defense pressure, etc.

Columns:
['team_api_id', 'date', 'buildUpPlaySpeed', 'buildUpPlaySpeedClass', 
 'buildUpPlayDribbling', 'buildUpPlayDribblingClass', 'buildUpPlayPassing', 
 'buildUpPlayPassingClass', 'buildUpPlayPositioningClass', 
 'chanceCreationPassing', 'chanceCreationPassingClass', 
 'chanceCreationCrossing', 'chanceCreationCrossingClass',
 'chanceCreationShooting', 'chanceCreationShootingClass',
 'chanceCreationPositioningClass', 'defencePressure',
 'defencePressureClass', 'defenceAggression', 'defenceAggressionClass', 
 'defenceTeamWidth', 'defenceTeamWidthClass', 'defenceDefenderLineClass']

And so, we were able to continue toward our ultimate goal: using match-specific player and team attributes as features to predict football match outcomes.

Part 2. Cleaning Data

Before putting together our various data sources (into one, cleaned dataframe ready to be fed into a hungy ML model) we had to do a deep-dive on each of our datasets (players, teams, and matches) and extract just the relevant information.

a. Checking for Multicollinearity within Player Attributes

We first checked for correlations within our player attributes to account for any multicollinearity between features before building our models.

Looking at this matrix, we observed that:

All of our goalie attributes are correlated with each other and negatively correlated with most other attributes. This makes sense, as most players are field players and don’t possess strong goalie skills.
There are a number of attributes that experience strong correlation, including: potential, ball control, short_passing, acceleration, long_shots, vision, standing_tackle, and sliding_tackle.

So, we decided to combine the five goalie attributes into one attribute, called gk_overall and drop all the attributes that experienced high degrees of correlation (as well as non-numerical attributes like attacking and defensive work rate).

b. Contextualizing Team Attributes

Next up: the team attributes. Though ultra-football aficionados may be able to understand what a high score for chanceCreationPassing means, we (as simple football fans) wanted to do a little bit more digging into how scores were assigned for each of the team attributes. So, we decided to look at how various scores for team attributes corresponded with the qualitative description of the attribute.

We saw that high scores for the four visualized attributes (buildUpPassing, chanceCreationPassing, defenceAggression, defenceWidth) corresponded with long passing, risky passing, double defense (two players on one opposing player, high pressure) and wide defense formations. Contextualizing these descriptors, we decided that a higher score for most (if not all) of the team-wide attributes, generally indicated higher competitiveness. Consequently, we decided to keep just the numerical team attributes as features (now that we better understood what they meant) instead of one-hot-encoding every class of team attribute.

c. Combining Player/Team Statistics Over Time

About halfway through our data cleaning process, we identified something important. Our data contained multiple statistic measurements for the same player or team across different dates. This made sense. As players and teams grew and changed between seasons, the FIFA database updated their skill assessments periodically.

For the player data, we decided that such a level of granularity came with a tradeoff of making our model unnecessarily complicated — diminutive updates in each attribute were unlikely to provide any marginal increase in the performance of our predictive model (i.e. the difference between a player’s sprint speed scored at 68.0 instead of 67.7 is negligible). So, we decided to average out every player’s attribute scores over the available time period for that player, to get a general sense of the capabilities of the player.

However, we thought about the team data a little differently. It was more feasible for the capabilities of a team to drastically change from season to season (think about a team that drafted three star players in the 2011–2012 season, for example, that it didn’t have before). Hence, we decided not to do any averaging/aggregation for the team statistics — instead, opting to map the teams playing in matches (in our matches data) to the most recent team statistics we had for them… which though helpful, turned out to be quite a tedious process.

d. Filling in Missing Player and Team Data

Now, one of the biggest part of data cleaning is accounting for the missing values/nulls. Both our player and team data were missing values. We employed different strategies to tackle each of these challenges.

Out of the more than 11,000 players in the player dataset, we saw that we were missing data on five attributes (volleys, curve, agility, balance, and jumping) for just about 500 players. Now, dropping the rows with missing data would have been harmful as it would have meant throwing out all of the other data (~20 other attributes) we had for these 500 players. Dropping the five attributes with missing data also would have created a data disadvantage for the ~10,500 players for which there were scores. So, that was not an optimal solution either. Instead — because our data was large and diverse and the number of missing values relatively small — we decided to simply impute the missing values of these five attributes with the mean value of each of the attributes (across all players).

Example:
merged_players_df['volleys'].fillna(value=merged_players_df['volleys'].mean(), inplace = True)

Now the team data was a harder beast to tackle. We noticed that team attribute measurements were taken on six different dates throughout the eight seasons of matches. However, around 99 teams were missing measurements on at least one of these dates. So, we decided to fill in the measurements on these dates using the median value of the scores we did have for these teams (note: we chose the median to account for any one-off bad seasons or exceptionally good seasons of these teams).

And with that, we solved the problem of missing data!

Part 3. Data Integration and Feature Extraction

After cleaning up our player and team data, there were a couple more steps we needed to take to get our data ready for modeling! Remember: we were trying to use the attribute scores we had for teams and players as features, alongside the match result between two teams, to predict the outcomes of future matches.

Now, before we performed any data integration, our matches data contained information on: the league id of the league the match was within, the date of the match, the ids of the home team and away team, the goals scored by the home team (stored as home_team_goal), the goals scored by the away team (stored as away_team_goal), the starting rosters of the home and away team, and other (less relevant) data. So, the steps we took were as follows:

We dropped columns from the matches data that were not necessary to predict the outcome of a match, including: betting data, player formation data, and in-match statistics (cards, corners, fouls, etc.)
We built columns into the match data corresponding to the average player statistics of the starting players on the home team and the starting players on the away team. Note: this required looking up each player’s statistics in the player dataset we had just cleaned!
We built columns into the match data corresponding to the most recently measured team statistics of the home and away teams.
We converted the league_id field into a league name (from another FIFA dataset).
We created a win label column that identified the winning team (home or away), for use in our classification model (classifying a future game as a home tie or win). To do this, we used the home_team_goal and away_team_goal fields. We set this label as 1 if the home team won, 0 if the home team and away team tied, and -1 if the away team won.

Example: where x is home goals scored and y is away goals scored

def home_win(x, y):
  if x > y:
    return 1
  if x == y:
    return 0
  else:
    return -1

6. We created a goal difference column that stored the goal difference (from the perspective of the home team), for use in our regressive models. To do this, we simply subtracted the away_team_goal field from the home_team_goal field.

Example: 
matches_df['goal_difference'] = matches_df.apply(lambda x : x.home_team_goal - x.away_team_goal, axis = 1)

And with that, we were ready to model! But first… we had to get down and dirty with EDA.

Part 4. Exploratory Data Analysis (EDA)

The best part of investigating any data science problem is… you guessed it… exploring the data! Before we started modeling, we wanted to visually unveil some trends in our data.

*Note, a complete set of our EDA visualizations can be found here.

a. Analyzing the Players (and Goalies) with Highest Overall Rating

First, we wanted to find out what contributed to a high rating for goalies and individual players. Questions we asked:

What commonalities can we find among the top players or goalies?
Are there any characteristics that top players have which could contribute to their teams’ success?

To find top players, we sorted by the overall_ranking score. After this, we saw that our top three players were Lionel Messi, Cristiano Ronaldo, and Franck Ribery (surprising, right?!).

We plotted the player attribute distributions for the top 200 players to see if we could identify any interesting trends.

From this visual, several interesting conclusions were drawn about the top players:

Attributes like sprint_speed, strength, and balance, which are related to biology as well as skill, have the tightest complete distributions (including outliers). This makes sense as, presumably to some extent, the best and overall-players are naturally physically gifted.
Attributes like shot_power, jumping, and stamina, are tightly distributed, save for a set of outliers. We thought it would be interesting to further investigate if these outliers were a function of position — i.e. perhaps some of the top overall-rated players are defenders, who have relatively lower shot power.
Attributes like intercepting, and marking have wide distributions. This may indicate that high scores on these attributes do not necessarily influence overall rating as much (and may, consequently not influence a team’s strength as much).

To investigate direct relationships between player attributes and overall rating (as well as to investigate if outliers were a function of position), we took a closer look at shot_power and marking.

Some key takeaways from these graphs:

The top players on the lower end of the shot power attribute, like Gianluigi Buffon, are goalkeepers whereas the top players on the higher end of the sprint_speed attribute, like Lionel Messi and Cristiano Ronaldo are forwards/strikers. This explains some of the variance in the shot power attribute!
Conversely, the top players on the higher end of the marking attribute, like Philipp Lahm and Cesc Fabregas, are defenders/midfielders, where as the top players on the lower end of the marking attribute are forwards/strikers.
Altogether, this means that overall rating is probably calculated with relation to different subsets of features — if a player (like Messi) is consistently high on offensive attributes, or if a player (like Buffon) is consistently high on defensive/goalkeeping attributes, they’ll remain at the top of the overall rating.

b. Are Some Leagues More Competitive Than Others?

Next we wanted to investigate if there were any clear trends (in player attributes) that would indicate if some of the eleven leagues we investigated were more competitive than the others. Specifically, we looked at at average player dribbling, average player finishing, average player rating, and average player speed across leagues.

These are interesting results! There are certainly disparities in average player attributes across leagues: average dribbling, finishing, and speed vary greatly. These attributes were quite tightly distributed between our top 200 players, so it could mean that there is a higher concentration of strong players in just a few leagues. This could also imply that (1) these features could be good predictors of an individual team’s strength/competitiveness (just like they indicate a league’s competitiveness) and (2) the outcome of matches in more competitive leagues may be influenced by a higher number of factors than in a less competitive league.

Also, four leagues (England Premier League, Germany 1. Bundesliga, Italy Serie A, and Spain LIGA BBVA) emerged as superior in each of these categories. That’s pretty cool… this conclusion is exactly what public opinion supports.

c. What Are the Attributes of the Winningest Teams?

Next up, we wanted to perform a bit more complicated analysis on our winningest teams, across leagues, and across seasons. Specifically, we focused on the winningest teams within two of the more competitive leagues (based on our prior analysis): the English Premier League and the Spanish LIGA BBVA (LaLiga).

For these winningest teams, we wanted to look at the team attributes to understand what features were shared by the most successful teams (and see if we could figure out what features our model might weight more!). The first step was to find the teams with the most wins in every season, from 2008–2016, in each of the two aforementioned leagues.

Winningest teams in English Premier League (left) and Spanish LaLiga (right) from 2008–2016

There are a few teams that have consistently been killing it, and winning the most: Manchester United, FC Barcelona, and Real Madrid CF.

We wanted to visualize some of the team characteristics of these teams in their winning seasons, in comparison to the averages across the league. Specifically, we looked at:

Manchester United in the 2010/2011 season
FC Barcelona in the 2012/2013 season

Now, this wasn’t exactly what we expected, was it? While the winning teams beat average (most times by a lot) on a few attributes, they aren’t necessarily better than average on all. Some takeaways we can glean from this analysis:

Strong defensive skills in a team seem to be a decent indicator of the team’s strength (FC Barcelona significantly beat average on all defensive attributes).
Strong offensive skills that can get the ball in the net — specifically, buildUpPlaySpeed and chanceCreationCrossing/Shooting may make up for weak defensive skills, to place a team in the top spot.

d. Are the Top Goal-Scoring Teams Offensively Strong?

Next we wanted to see if there was some kind of correlation between the teams who scored the most goals and strength on offensive attributes. If there was, this would be a very positive indication for the predictive capability of our models.

So, we started by finding the highest goal-scoring teams in every season. To do this, we summed the goals of every team (as the home and away team), in every season, in every league, and sorted by the total goals scored.

Top goal scorers in English Premier League (left) and Spanish LaLiga (right) from 2008–2016

Starting to see a trend in the top teams? Again, house names like Manchester United, Manchester City, FC Barcelona, and Real Madrid were in the top goal scorers.

Out of curiosity, we first visualized these statistics in scatterplots to see how the max goals scored by any one team have changed over time for each league across each season from 2008- 2016.

From these plots we saw that English Premier League had four different teams that scored the most goals each season from 2008–2016 compared to Spanish LaLiga, whose top goal scorer fluctuated between only two teams between 2008–2016. We also saw that the total goals of the top team were much higher across the Spanish teams compared to English teams, with the minimum number of max goals scored as 71 and 102, respectively. This could indicate that the Spanish LaLiga has a couple dominant teams, with the rest of the league several rungs lower in competitiveness. We hypothesized that this trend could even be learned by a model — one which predicts higher goal differences for FC Barcelona and Real Madrid consistently.

Next, we plotted the offensive attributes of some of these top scorers. We determined offensive attributes to be: buildUpPlaySpeed, buildUpPlayPassing, chanceCreationPassing, chanceCreationCrossing, chanceCreationShooting, avgCrossing, avgFinishing, avgHeadingAccuracy, avgCurve, avgFreeKickAccuracy, and avgShotPower. Specifically, we looked at:

Chelsea in the 2009/2010 season
Real Madrid in the 2011/2012 season

Super cool! This is pretty much exactly what we expected — these top scoring teams were above average on pretty much every offensive attribute. This is a great sign for the (potential) power of our models.

e. Are the Minimum Goal-Conceding Teams Defensively Strong?

Our next question was a variation of the previous: were the minimum goal-conceding teams defensively strong? In a similar fashion, we found the teams who conceded the least goals during every season in the English and Spanish leagues.

Teams that conceded the least goals in the English Premier League (left) and Spanish LaLiga (right) from 2008–2016

To see how these goal metrics have changed over time, we again plotted the data for both leagues.

These were interesting! Chelsea, Manchester United, and Manchester City were both offensively and defensively strong since they scored the most goals and gave up the least. This provided good insight for future modeling because the team characteristics among these three teams could potentially be good predictors for success in a match.

Next, we plotted the defensive attributes of some of these minimum goal conceders against the averages across the league. We defined defensive attributes as: defensePressure, defenseAggression, defenseTeamWidth, avgJumping, avgStrength, avgInterceptions, avgPositioning, avgPenalties, avgMarking, avgGKOverall. Specifically, we looked at:

Atletico Madrid in the 2012/2013 season
Tottenham Hotspur in the 2015/2016 season

Again, this is awesome! This is pretty much exactly what we expected — these teams who conceded the last goals are above average on pretty much every defensive attribute.

f. Have Leagues Become More Competitive Over Time?

Next, we wanted to see if the leagues had become more competitive (in terms of the player and team attribute scores) over time. We plotted average build up play passing, defense aggression, dribbling, and overall player rating per league, over time.

These trends were super revealing. It seemed that the team average build up play passing per league has converged over time. This could mean that teams are a bit more even now, in terms of passing skills, and thus the ratings have become more relative. Similarly, we saw a bit of a convergence with team average defense aggression, with a small spike in the 2015–2016 season. This could speak to the evolution of the professional game with rule changes and competitive atmosphere.

Finally, we saw that both average team player dribbling and player rating have generally exhibited downward trends over time. We figured this likely indicated a change in the rating systems of FIFA (i.e. less generous to modern players in comparison to some of the greats) as this trend is not isolated to any one league.

g. How Have the Winningest Teams Changed Over Time?

Finally (yes, finally!) we wanted to see how the winningest teams have evolved across the eight seasons for each league (mostly for our own curiosity).

These graphs were especially cool because we could visualize how the power in these leagues has shifted over seasons! For example, we can see the end of the Manchester United reign in 2012 and the start of the Manchester City ascent in 2010/2011.

Part 5. Modeling

EDA revealed that there were, in fact, some correlations between the player/team-wide attributes of a team and the match outcomes. For example, we found that strong offensive teams tend to be the top goal scorers, and strong defensive teams tend to be the top goal blockers. And, after going through all of that, it was finally time for the (actual) most exciting part of the pipeline. The models (duh!). As a reminder, we had two goals to address with modeling:

Predicting goal difference (regression)
Predicting match winner (classification)

Let’s jump in.

a. Predicting Goal Difference with Regression

Our first step was to separate our data into training data and test data. We used an 80/20 split.

x_train, x_test, y_train, y_test = train_test_split(features, target, 
                                       test_size = 0.2, random_state=545)

We started with a simple linear regression model to predict goal difference, using goal_difference (the difference in the number of goals between the home teams and away team) as the label.

We first ran a traditional regression model in SciKit-Learn, using all of the features we had (home player attributes, home team-wide statistics, away player attributes, away team-wide statistics) to predict the target goal difference (which was a lot of features).

from sklearn.linear_model import LinearRegression

linear_reg = LinearRegression()
linear_reg.fit(x_train, y_train)
y_pred = linear_reg.predict(x_test)

linear_reg_score = linear_reg.score(x_test, y_test)

The R² score of this model was around 0.22.

Initially, this score was a bit puzzling. Such a low score could, in some scenarios, indicate very little correlation. So, we decided to plot the predicted score difference (by our regression model) vs. the actual score difference to see if we could identify any trend at all.

Predicted vs. Actual Goal Difference for a Simple Linear Regression Model

From this plot, it was easier to understand what the linear regression model was doing. The first important thing to note was that almost 97% of the outcomes in the training set fell between the values of -3 and 4.

round(100 * len([x for x in y_train if -3 <= x <= 4]) / len(y_train), 2
-> 96.99%

For this reason, the model rarely (if ever) guesses outside of those bounds. We also noted that there is an upward trend in the plot above. As the actual goal difference increases, the predicted goal difference (roughly) increases as well.

The reason behind the 𝑅² value (that we considered to be relatively low) is simply the enormity of the variance is in the real data. In soccer matches, and sports in general, it is notoriously difficult to predict who will win any given game. Regardless of how much knowledge you have about a team, upsets happen all the time. The same two teams could play each other twice in a row, and have completely opposite score differences. And, the linear model is not even classifying which team won, it is trying to predict the difference in goals scored in the game. With all of this considered, we realized the 𝑅² value of around 0.22 actually did indicate some learning by our model.

We decided to compare this to a baseline model of random guessing, as if by a person with no knowledge of the two teams playing. So, we ‘trained’ a model to randomly guess a goal difference between -3 and 4.

random_model = [random.choice(range(-3, 5)) for _ in y_test]
r2_score(y_test, random_model)

This model, of course, did not perform very well. In fact, it had an R² score of about -1.6. Obviously random guessing is not a good model for the goal differential of a game, but using it as a baseline allowed us to confirm that our model has done some learning.

As it stood, our model had 74 features. That is a lot! We decided to standardize our data, and then use Principal Component Analysis to reduce the number of dimensions of our features, to see if this would improve the success of our linear regression model.

scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

pca = PCA(n_components=x_train_scaled.shape[1])
pca.fit(x_train_scaled)

PCA cumulative Explained Variance Ratio (33 components explains 90% of the variance)

By using PCA with 33 components (that explain over 90% of total variance in the dataset), we achieved a similar, but slightly worse R² value: about 0.19.

Next, in order to try to focus the classifier on the most important contributing features, we decided to use two regularized models, Lasso and Ridge, to try to predict the goal difference.

We trained a Ridge model, with an alpha value of 0.1,

ridge = Ridge(alpha = 0.1)
ridge.fit(x_train, y_train)
y_pred = ridge.predict(x_test)
ridge_score = ridge.score(x_test, y_test)

and a Lasso model with an alpha value of 0.1.

lasso = Lasso(alpha = 0.1)
lasso.fit(x_train, y_train)
y_pred = lasso.predict(x_test)
lasso_score = lasso.score(x_test, y_test)

These models each achieved R² scores of ~0.22, differing from our linear regression model by just 0.00000004.

Training these models with PCA-transformed data decreased their R² scores to around 0.19 each.

So, why did Linear Regression perform most strongly out of all of our models?

PCA, along with regularization methods like Lasso and Ridge, are used to reduce multicollinearity. However, during data cleaning, we did a lot to ensure very little multicollinearity, including combining (or dropping) features that showed strong correlations. It is possible that the remaining features did not have strong multicollinearity, which is why reducing the number of features or introducing regularization did not have a strong effect on the success of the model on test data.

In thinking about how much variance there is in the real life goal differentials of soccer matches, we decided it would be much more achievable to try to classify which team won (or if there was a draw) in a given match.

b. Predicting Match Winner via Classification

We next tried to solve a more simple, but equally interesting, problem than goal difference: classifying who won the match.

For this problem, we used the home_win column in our matches data as the label for classification. Reminder: this column had a value of 1 if the home team won, 0 if there was a draw, and -1 if the home team lost. Again, we split into training and test data using an 80/20 split.

First, it was important to note that of our data, 25% of matches ended in a draw, the home team won in 46% of matches, and the home team lost in 29% of matches. This meant that a baseline model of just predicting a home team win every time would have an accuracy of 46%. So, from the get go, we were shooting for a value higher than that.

We first decided to use a Logistic Regression model as our classifier. This classifier, fit on PCA data with 33 principal components, achieved a stunning accuracy of 53.27%!

The Logistic Regression model yielded the following Confusion Matrix:

Confusion Matrix for Logistic Regression Classifier

The confusion matrix showed us that our model rarely ever predicted a draw, classifying draws in a much lower proportion than the actual proportion of games that result in draws. This meant the model’s accuracy came largely from predicting wins and losses.

Even through human intuition, this made sense. There is not much about two teams that would indicate that they are likely to draw, even if the teams are at a similar skill level. However, we would be able to say with confidence if we believed one team was much more likely to win than another because we believed that team was much better than the other.

It may be for a similar reason that our model was more inclined to predict a win or a loss rather than a draw. It’s something that can be done with a (seemingly) much higher certainty.

We also built a Random Forest Classifier, instantiated with a max depth of 10. We again fit this to our PCA-transformed data, and achieved a similar accuracy of 53.13%, with the following confusion matrix:

Confusion Matrix for Random Forest Classifier

This model yielded a very similar looking confusion matrix! We concluded that the Random Forest model had a similar decision-making process to that of the Logistic Regressor.

Side Note: Betting Odds

It is hard to talk about predicting sports outcomes without talking about sports betting. Ideally, we would create a model that is accurate enough to consistently make money in a sports betting scenario.

As it turns out, at -110 odds, an accuracy of 54.2% is all that’s required to profit. Remember, our models had an accuracy of over 53%! That means we can get rich off of them! Right?

In practice, sadly, betting odds for soccer match win/loss/draws are rarely (if ever) at -110. But if we did make a model that could make us millions in sports betting, we probably wouldn’t be sharing the details on medium.com.

Modeling Conclusion

Overall, we found that our simple linear regression model performed best at the task of predicting the goal difference of a game, out-performing regularized models and models trained with data transformed by PCA. This was likely a result of the types of features in our dataset, that are largely non-collinear. We were able to achieve R² values of around 0.2 at this task. Because the goal difference of a game is so hard to predict just based on the attributes of a team, we understand why this value is relatively low.

At the task of classifying the outcome of a game (win, loss, or draw) our Logistic Regression and Random Forest models performed very well, achieving around a 53% accuracy. This was shown to be much higher than guessing randomly or repeatedly guessing a control value. As a result, we are ultimately pleased with this level of accuracy from these models!

And hey — if you end up winning big on your next sports bet, feel free to credit it to this article ;).