Predicting the Winner of Copa America 2024 with Data Science — Part 2

12 min readJul 4, 2024

¿Prefieres leer en español? Haz clic aquí para leer el artículo en español.

In my previous post, I simulated the entire Copa America 2024 tournament using the Elo rating system to measure team strength, predicting Argentina as the winner. I also mentioned that it would be better to simulate the knockout stage after the group stage since the Elo ratings would more accurately reflect the teams’ strengths by then.

Now that the Copa America 2024 group stage is over and the knockout stage begins today (July 4, 2024) with Argentina facing Ecuador, it’s the perfect time to:

Evaluate how my original simulation performed for the group stage.
Run an updated simulation for the knockout stage.
Make the final call on who will be the Copa America 2024 champion!

How did my group-stage simulation perform?

You be the judge! 😉

Here are the simulation results:

| Group | Team          | Mean points | 99% confidence interval | 
|-------|---------------|-------------|-------------------------|
|-------|---------------|-------------|-------------------------|
|   A   | Argentina     |    6.76     |       6.52 - 7.05       |
|   A   | Chile         |    3.55     |       3.11 - 3.96       |
|   A   | Peru          |    3.05     |       2.74 - 3.53       |
|   A   | Canada        |    2.65     |       2.27 - 3.03       |
|-------|---------------|-------------|-------------------------|
|   B   | Ecuador       |    5.75     |       5.24 - 6.12       |
|   B   | Mexico        |    4.59     |       4.06 - 4.98       |
|   B   | Venezuela     |    3.38     |       2.99 - 3.92       |
|   B   | Jamaica       |    2.32     |       1.92 - 2.70       |
|-------|---------------|-------------|-------------------------|
|   C   | Uruguay       |    6.71     |       6.29 - 7.14       |
|   C   | United States |    4.71     |       4.31 - 5.30       |
|   C   | Panama        |    2.88     |       2.48 - 3.32       |
|   C   | Bolivia       |    1.83     |       1.42 - 2.18       |
|-------|---------------|-------------|-------------------------|
|   D   | Colombia      |    6.69     |       6.34 - 7.20       |
|   D   | Brazil        |    5.38     |       5.06 - 5.72       |
|   D   | Paraguay      |    2.68     |       2.33 - 3.06       |
|   D   | Costa Rica    |    1.45     |       1.23 - 1.77       |

The table above shows teams sorted by the average number of points they earned across 100 simulation rounds (each round consisting of 100 runs). It also includes the 99% confidence interval for these points to indicate uncertainty. I used these confidence intervals to determine whether I could confidently predict if a team would finish 1st, 2nd, 3rd, or 4th in their group. Specifically, if the confidence intervals of two teams’ simulated points overlapped, I would not make a prediction between them.

Here is what I stated regarding these results in my previous post:

- Group A should see Argentina easily top the group. However, while Chile has a clear edge over Peru, it’s not definitive since the confidence intervals between the two overlap. Canada should definitely end up in last place.
- Group B should definitely be: Ecuador, Mexico, Venezuela and Jamaica. No overlap in confidence intervals, and if we look at their team strengths, it’s clear the Ecuador tops the group. Although I would say this is the sleeper group… lower team strength of the tournament overall.
- Group C should definitely be: Uruguay, United States, Panama, and Bolivia. Panama might at best get a win against Bolivia, just like the 2016 Copa America (when it drew against eventual finalists Argentina and Chile).
- Group D should definitely be: Colombia, Brazil, Paraguay, Costa Rica.

And here are the actual group stage results:

| Group | Team          | Points | 
|-------|---------------|--------|
|-------|---------------|--------|
|   A   | Argentina     |    9   |
|   A   | Canada        |    4   |
|   A   | Chile         |    2   |
|   A   | Peru          |    1   |
|-------|---------------|--------|
|   B   | Venezuela     |    9   |
|   B   | Ecuador       |    4   |
|   B   | Mexico        |    4   |
|   B   | Jamaica       |    0   |
|-------|---------------|--------|
|   C   | Uruguay       |    9   |
|   C   | Panama        |    6   |
|   C   | United States |    3   |
|   C   | Bolivia       |    0   |
|-------|---------------|--------|
|   D   | Colombia      |    7   |
|   D   | Brazil        |    5   |
|   D   | Costa Rica    |    3   |
|   D   | Paraguay      |    0   |

There are several ways to evaluate these results, some more favorable to the simulation than others. Here are three different evaluations:

Evaluation 1: Which teams that were definitely predicted to advance to the knockout stage actually advanced?

The simulation predicted with certainty:

Group A: Argentina (TRUE)
Group B: Ecuador (TRUE), Mexico (FALSE)
Group C: Uruguay (TRUE), United States (FALSE)
Group D: Colombia (TRUE), Brazil (TRUE)

This results in 5 out of 7 teams called correctly, approximately 71.4% correct. I also played it safe with Group A and did not call the second place.

Evaluation 2: Which teams that were definitely predicted to place in their respective positions actually placed as such?

The simulation predicted with certainty:

Group A: 1. Argentina (TRUE)
Group B: 1. Ecuador (FALSE), 2. Mexico (FALSE), 3. Venezuela (FALSE), 4. Jamaica (FALSE)
Group C: 1. Uruguay (TRUE), 2. United States (FALSE), 3. Panama (FALSE), 4. Bolivia (TRUE)
Group D: 1. Colombia (TRUE), 2. Brazil (TRUE), 3. Paraguay (FALSE), 4. Costa Rica (FALSE)

This results in 5 out of 13 teams called correctly, approximately 38.5% correct. Adding that I incorrectly predicted Canada would finish last, the accuracy drops to 5 out of 14, or 35.7%.

This evaluation is strict since one incorrect prediction affects others in the group. The next evaluation is more lenient.

Evaluation 3: Which relative team positions were predicted correctly?

This evaluation looks at individual team placements within their groups:

Group A: Argentina placing higher than everyone else (TRUE, TRUE, TRUE)
Group A: Chile placing higher than Canada (FALSE)
Group B: Ecuador placing higher than everyone else (TRUE, FALSE, TRUE)
Group B: Mexico placing higher than Venezuela and Jamaica (FALSE, TRUE)
Group B: Venezuela placing higher than Jamaica (TRUE)
Group C: Uruguay placing higher than everyone else (TRUE, TRUE, TRUE)
Group C: United States placing higher than Panama and Bolivia (FALSE, TRUE)
Group C: Panama placing higher than Bolivia (TRUE)
Group D: Colombia placing higher than everyone else (TRUE, TRUE, TRUE)
Group D: Brazil placing higher than Paraguay and Costa Rica (TRUE, TRUE)
Group D: Paraguay placing higher than Costa Rica (FALSE)

Adding up all the correct and incorrect predictions, we have 17 correct predictions out of 22 made with certainty, approximately 77.3%.

How will the knockout stage play out?

The key difference between simulating a group stage and a knockout stage is that in a group stage, we know all the matches that will occur. In contrast, in a knockout stage, we only know the first round of matches.

To tackle this, I simulated each knockout stage game and repeated this knockout stage simulation 10,000 times to gather comprehensive statistics.

Here’s how I did it.

Setting Up the Environment

First, we need to import the necessary libraries and set up the initial conditions for our simulation.

import numpy as np
import pandas as pd
import csv
from tqdm import tqdm
from joblib import Parallel, delayed
from src.copa_america_simulator import *

We also need to prepare some data. Specifically, we need a roster of the teams in the Copa America 2024 knockout stage along with their Elo ratings after the group stage. I prepared a CSV file with this information.

| Team      | Elo rating |
|-----------|------------|
| Argentina |    2144    |
| Colombia  |    2030    |
| Uruguay   |    2027    |
| Brazil    |    2021    |
| Ecuador   |    1853    |
| Venezuela |    1826    |
| Panama    |    1747    |
| Canada    |    1741    |

Additionally, I created another CSV file with the knockout stage matchups.

| Match | Home Team | Away Team | Advances | To Match | Penalties? |     Stage     |
|-------|-----------|-----------|----------|----------|------------|---------------|      
|   0   | Argentina | Ecuador   |          |    4     |            | quarterfinals |
|   1   | Venezuela | Canada    |          |    4     |            | quarterfinals |
|   2   | Uruguay   | Brazil    |          |    5     |            | quarterfinals |
|   3   | Colombia  | Panama    |          |    5     |            | quarterfinals |
|   4   |           |           |          |    6     |            |  semifinals   |
|   5   |           |           |          |    6     |            |  semifinals   |
|   6   |           |           |          |    7     |            |     final     |
|   7   |           |           |          |    7     |            |  third place  |

Python scripts

First, I defined a function to simulate a knockout-stage match. This function reads a given match and outputs the results in a format consistent with the knockout-stage CSV files mentioned above.

def simulate_playoff_game(game, ternary=True):
    """
    Simulates a single game in the knockout stage
    """

    import numpy as np

    home = game["elo_prob_home"]
    away = 1 - game["elo_prob_home"]
    tie = 0

    # Simulating game proper
    wildcard = random.uniform(0, 1)

    # Concoction to go from binary probabilities to ternary
    # 50-50% should translate into 1/3, 1/3, 1/3 (even split of probability space)
    # With increasing lopsidedness (e.g. 75-25%), the stronger team should see increased win probability,
    # and the weaker team should see decreased win probability. And in terms of ties, that also should decrease
    # as lopsidedness increases, but I assume that at a lower rate than weaker team win probability.
    if ternary:
        if home > 0 and home < 1:
            home_odds = home / away
            tie_odds = 1
            away_odds = 1 - abs(home - 0.5) * 2

            home_odds1 = (home / away) / min(away_odds, tie_odds, home_odds)
            tie_odds1 = 1 / min(away_odds, tie_odds, home_odds)
            away_odds1 = (1 - abs(home - 0.5) * 2) / min(away_odds, tie_odds, home_odds)

            home = home_odds1 / (home_odds1 + tie_odds1 + away_odds1)
            tie = tie_odds1 / (home_odds1 + tie_odds1 + away_odds1)
            away = away_odds1 / (home_odds1 + tie_odds1 + away_odds1)

        elif home == 0:
            tie = 0
            away = 1

        elif home == 1:
            tie = 0
            away = 0

        else:
            raise ValueError("Probabilities must be floats between 0 and 1, inclusive")
    else:
        pass

    if wildcard >= 0 and wildcard < away:
        return game["away_team"], game["home_team"], 0, False

    if wildcard >= away and wildcard < away + tie and ternary:
        # Simulating outcome of a penalty shootout. I assume it is a coin-toss. An advancing team is needed.
        teams = [game["away_team"], game["home_team"]]
        advances = np.random.choice(teams)
        teams.remove(advances)
        return advances, teams[0], 0.5, True

    if wildcard >= away + tie and wildcard <= 1:
        return game["home_team"], game["away_team"], 1, False

Next, I wrote a function to simulate the entire knockout round by iterating over every game.

def simulate_playoffs(games, teams, ternary=True):
    """
    Simulates the entire knockout stage
    """

    for game in games:
        team1, team2 = teams[game["home_team"]], teams[game["away_team"]]

        # Home field advantage is B.S. in modern soccer
        elo_diff = team1["rating"] - team2["rating"]

        # This is the most important piece
        game["elo_prob_home"] = 1.0 / (math.pow(10.0, (-elo_diff / 400.0)) + 1.0)

        # If game was played, maintain team Elo ratings
        if game["advances"] == "" or game["loses"] == "":

            game["advances"], game["loses"], game["result_home"], game["penalties"] = (
                simulate_playoff_game(game, ternary)
            )

            # Elo shift based on K
            shift = 50.0 * (game["result_home"] - game["elo_prob_home"])

            # Apply shift
            team1["rating"] += shift
            team2["rating"] -= shift

        # This is to populate quarterfinals and the such depending on previous rounds' results
        next_stage = int(game["to_match"])

        if next_stage < 6:

            if games[next_stage]["home_team"] == "":
                games[next_stage]["home_team"] = game["advances"]
            else:
                games[next_stage]["away_team"] = game["advances"]

        elif next_stage == 6:
            if games[next_stage]["home_team"] == "":
                games[next_stage]["home_team"] = game["advances"]
            else:
                games[next_stage]["away_team"] = game["advances"]

            if games[next_stage + 1]["home_team"] == "":
                games[next_stage + 1]["home_team"] = game["loses"]
            else:
                games[next_stage + 1]["away_team"] = game["loses"]

        else:
            pass

I then repeated this simulation 10,000 times:

n = 10000
playoff_results_teams = []
playoff_results_stage = []

for i in tqdm(range(n)):
    overall_result_teams = dict()
    overall_result_stage = dict()
    games = read_games("data/playoff_matches1.csv")
    teams = {}
    
    for row in [
        item for item in csv.DictReader(open("data/playoff_roster1.csv"))]:
        teams[row['team']] = {
            'name': row['team'],
            'rating': float(row['rating'])
            }
    
    simulate_playoffs(games, teams, ternary=True)
    
    playoff_pd = pd.DataFrame(games)
    
    # This is for collecting results of simulations per team
    for key in teams.keys():
        overall_result_teams[key] = collect_playoff_results(
            key,
            playoff_pd
            )
    playoff_results_teams.append(overall_result_teams)
    
    # Now, collecting results from stage-perspective
    overall_result_stage['whole_bracket'] = playoff_pd['advances'].to_list()
    overall_result_stage['Semifinals'] = playoff_pd.loc[playoff_pd['stage'] == 'quarterfinals', 'advances'].to_list()
    overall_result_stage['Final'] = playoff_pd.loc[playoff_pd['stage'] == 'semifinals', 'advances'].to_list()
    overall_result_stage['third_place_match'] = playoff_pd.loc[playoff_pd['stage'] == 'semifinals', 'loses'].to_list()
    overall_result_stage['fourth_place'] = playoff_pd.loc[playoff_pd['stage'] == 'third_place', 'loses'].to_list()[0]
    overall_result_stage['third_place'] = playoff_pd.loc[playoff_pd['stage'] == 'third_place', 'advances'].to_list()[0]
    overall_result_stage['second_place'] = playoff_pd.loc[playoff_pd['stage'] == 'final', 'loses'].to_list()[0]
    overall_result_stage['Champion'] = playoff_pd.loc[playoff_pd['stage'] == 'final', 'advances'].to_list()[0]
    overall_result_stage['match4'] = list(playoff_pd.loc[4, ['home_team', 'away_team']])
    overall_result_stage['match5'] = list(playoff_pd.loc[5, ['home_team', 'away_team']])
    
    playoff_results_stage.append(overall_result_stage)

I collected the results of these simulations in playoff_results_teams and playoff_results_stage. For this post, we are more interested in the data from playoff_results_stage.

What do we get?

Let’s unfold the results match by match.

We have Argentina vs. Ecuador and Venezuela vs. Canada:

# This is a semifinal match
results_stage['match4'].value_counts()

[Argentina, Venezuela]   5489
[Argentina, Canada]      3316
[Ecuador, Venezuela]      737
[Ecuador, Canada]         458

The distribution of possible semifinal matchups shows that the most likely matchup is Argentina vs. Venezuela. However, let’s break it down further:

Argentina advances to the semifinals in 8,816 out of 10,000 simulations. I would personally bet on Argentina advancing to the semifinals.
Venezuela advances to the semifinals in 6,226 out of 10,000 simulations. This prediction is much tougher.

I’ll go out on a limb and predict it will be Argentina vs. Venezuela.

Next, we have Uruguay vs. Brazil and Colombia vs. Panama:

# This is a semifinal match
results_stage['match5'].value_counts()

[Uruguay, Colombia]    4422
[Brazil, Colombia]     4288
[Brazil, Panama]        659
[Uruguay, Panama]       631

With Colombia advancing in 8,710 out of 10,000 simulations, we can be confident that Colombia will make it to the semifinals.

As for Uruguay vs. Brazil… 🔥 This match shouldn’t even be predicted — it should be enjoyed. The history between these two countries, both on and off the soccer field, makes this THE match of the tournament. 🔥

What’s the most likely final match?

Now that I’ve chickened out of predicting the Uruguay vs. Brazil game, let’s answer an easier question:

results_stage['Final'].value_counts()

[Argentina, Colombia]    3427
[Argentina, Uruguay]     2276
[Argentina, Brazil]      2231
[Venezuela, Colombia]     368
...

If I had to bet, I’d bet on the final being played between Argentina and Colombia. The problem is that the other two possibilities come from the Uruguay vs. Brazil game, and taken together, they are more likely than the Argentina vs. Colombia matchup.

All I can confidently say is that Argentina is the most likely finalist (8,072 out of 10,000 simulations as a finalist).

Also…

Who will be crowned champion?

results_stage['Champion'].value_counts()

Argentina       5423
Colombia        1570
Uruguay         1053
Brazil           999
Venezuela        441
Ecuador          332
Canada           160
Panama            22

Who else? 😉

In all seriousness, almost half of the simulations show Argentina winning the Copa America 2024. This can be explained by Argentina’s relatively easier path to the final, avoiding the top 3 strongest teams until then. Meanwhile, the other 3 strong contenders will face each other before the final.

Thus, the simulations show dispersed probabilities for the other teams, while consolidating around Argentina.

However, for all soccer enthusiasts, this doesn’t mean Argentina will easily win the trophy. Each knockout stage match must be played, and an easy path to the final won’t guarantee victory. Once Argentina faces either Colombia, Uruguay, or Brazil, all bets are off.

Conclusion

Soccer is often called “The Beautiful Game,” and every time I use data science and statistical techniques to forecast matches, I’m reminded of how beautifully unpredictable it is. While not entirely unpredictable, the randomness in soccer makes it highly entertaining:

Ecuador was predicted to beat Venezuela, but Eder Valencia’s expulsion from the game due to a reckless play left Ecuador with one less player for most of the match. This increased the team’s fatigue rate, leading to their loss.
A similar situation occurred in the United States vs. Panama game. Tim Weah lost his cool, punched Roderick Miller, and was justifiably sent off. Consequently, the United States lost a match they needed to win for the simulations and statistical expectations to align with reality. Because statistics predicted Uruguay would beat the United States, and THAT became reality.
And what about those matches where a team that played brilliantly and took numerous great shots on goal (but didn’t score) loses to a team that only generated one good shot (and scored)? Our innate desire for fairness is triggered in such situations, especially if you support one of the teams.

Feel free to share your thoughts and let me know which team you’re rooting for in Copa America 2024! Remember that you can also play around with this simulation by forking/cloning my GitHub repo.