Predicting the Winner of Copa America 2024 with Data Science — Part 1

10 min readJun 23, 2024

The Copa America 2024 is underway in the United States, and yes, I am excited! This is the oldest still-running continental soccer (football, if you are that kind of person) competition in the world. And I am excited since my two national teams play in this tournament: My home country, Panama, and my favorite top national team, Argentina (which is the current champion).

But I am also a data scientist, and I can’t simply watch games without a little statistics chip in my shoulder! The excitement of simulating a tournament like Copa America 2024, using the Elo rating system, is something I couldn’t resist.

The Elo rating system?

And yes, I said the Elo rating system. It was initially devised for chess, but has become a popular method for ranking teams in various sports, including soccer. To the point that FIFA actually uses a variant of this system for its own rankings (I personally believe the ranking at eloratings.net is the gold standard for international soccer, not FIFA’s).

It works by calculating the relative strength of teams based on their game results. If you’d like to learn more about the Elo rating system, there’s plenty of material online to learn more. For an intro, read this blog post, and for a soccer-specific intro, see here for international soccer and here for European club soccer.

In this blog post, I’ll walk you through the Jupyter notebook I created to simulate the Copa America 2024 using the Elo rating system. This simulation aims to predict which team might emerge as the champion by running multiple iterations of the tournament and analyzing the results.

This simulation uses Elo ratings from eloratings.net to measure team strength and update it after each simulated game. The Elo implementation is based on FiveThirtyEight’s NFL forecasting game.

Do you simply want to dive into the code? Fork my GitHub repo here.

Setting Up the Environment

First, we need to import the necessary libraries and set up the initial conditions for our simulation:

import numpy as np
import pandas as pd
import csv
from tqdm import tqdm
from joblib import Parallel, delayed
from src.copa_america_simulator import *

We also need to set up some data. First, we need to define the roster of teams that will play this tournament, alongside with their Elo rating before the start of the tournament. In this case, I prepared a csv file with this information:

| Group | Team          | Elo rating |
|-------|---------------|------------|
|-------|---------------|------------|
|  A    | Argentina     |    2144    |
|  A    | Peru          |    1744    |
|  A    | Chile         |    1725    |
|  A    | Canada        |    1721    |
|-------|---------------|------------|
|  B    | Ecuador       |    1876    |
|  B    | Mexico        |    1791    |
|  B    | Venezuela     |    1725    |
|  B    | Jamaica       |    1642    |
|-------|---------------|------------|
|  C    | Uruguay       |    1992    |
|  C    | United States |    1790    |
|  C    | Panama        |    1698    |
|  C    | Bolivia       |    1592    |
|-------|---------------|------------|
|  D    | Brazil        |    2028    |
|  D    | Colombia      |    2015    |
|  D    | Paraguay      |    1710    |
|  D    | Costa Rica    |    1620    |

I will also define another csv file with the actual group matches:

| Group | Date | Home Team      | Away Team      |
|-------|------|----------------|----------------|
| A     | 1    | Argentina      | Canada         |
| A     | 2    | Peru           | Chile          |
| B     | 3    | Mexico         | Jamaica        |
| B     | 4    | Ecuador        | Venezuela      |
| C     | 5    | United States  | Bolivia        |
| C     | 6    | Uruguay        | Panama         |
| D     | 7    | Brazil         | Costa Rica     |
| D     | 8    | Colombia       | Paraguay       |
| A     | 9    | Chile          | Argentina      |
| A     | 10   | Peru           | Canada         |
| B     | 11   | Venezuela      | Mexico         |
| B     | 12   | Ecuador        | Jamaica        |
| C     | 13   | Panama         | United States  |
| C     | 14   | Uruguay        | Bolivia        |
| D     | 15   | Paraguay       | Brazil         |
| D     | 16   | Colombia       | Costa Rica     |
| A     | 17   | Argentina      | Peru           |
| A     | 18   | Canada         | Chile          |
| B     | 19   | Mexico         | Ecuador        |
| B     | 20   | Jamaica        | Venezuela      |
| C     | 21   | United States  | Uruguay        |
| C     | 22   | Bolivia        | Panama         |
| D     | 23   | Brazil         | Colombia       |
| D     | 24   | Costa Rica     | Paraguay       |

Simulation Function

I defined a function to simulate the group stage of the tournament. This function reads the team rosters and match schedules from CSV files and then uses the Elo ratings to simulate match outcomes.

def run_group_stage_simulation(n, j):
    """
    Run a simulation of the group stage of the Copa America
    """
    
    teams_pd = pd.read_csv("data/roster.csv")
    
    for i in range(n):
        games = read_games("data/matches.csv")
        teams = {}
    
        for row in [
            item for item in csv.DictReader(open("data/roster.csv"))
            ]:
            teams[row['team']] = {
                'name': row['team'],
                'rating': float(row['rating']),
                'points': 0
                }
    
        simulate_group_stage(
            games,
            teams,
            ternary=True
            )
    
        collector = []
        for key in teams.keys():
            collector.append(
                {"team": key,
                 f"simulation{i+1}": teams[key]['points']}
            )

        temp = pd.DataFrame(collector)
        teams_pd = pd.merge(teams_pd, temp)
    
    sim_cols = [
        a for a in teams_pd.columns if "simulation" in a]
    teams_pd[
        f"avg_pts_{j+1}"
        ] = teams_pd[sim_cols].mean(axis=1)
    not_sim = [
        b for b in teams_pd.columns if "simulation" not in b]
    simulation_result = teams_pd[not_sim]
    
    return simulation_result

However, the above function is mainly there to enable parallelization using the joblib package. In order words, this function is mostly there to enable multiple runs of the simulation at once. But the “simulation” is really done by the simulate_group_stage function:

def simulate_group_stage(games, teams, ternary=True):
    """
    Simulates the entire group stage
    """

    for game in games:
        team1, team2 = teams[game["home_team"]], teams[game["away_team"]]

        # Home field advantage is BS
        elo_diff = team1["rating"] - team2["rating"]

        # This is the most important piece, where we set my_prob1 to our forecasted probability
        game["elo_prob_home"] = 1.0 / (math.pow(10.0, (-elo_diff / 400.0)) + 1.0)

        # If game was played, maintain team Elo ratings
        if game["result_home"] == "":

            game["result_home"] = simulate_group_stage_game(game, ternary)

            # Elo shift based on K
            shift = 50.0 * (game["result_home"] - game["elo_prob_home"])

            # Apply shift
            team1["rating"] += shift
            team2["rating"] -= shift

            # Apply points
            if game["result_home"] == 0:
                team1["points"] += 0
                team2["points"] += 3
            elif game["result_home"] == 0.5:
                team1["points"] += 1
                team2["points"] += 1
            else:
                team1["points"] += 3
                team2["points"] += 0

And the simulate_group_stage function uses another function simulate each game, “cleverly” called simulate_group_stage_game

def simulate_group_stage_game(game, ternary=True):
    """
    Simulates a single game in the group stage
    """

    home = game["elo_prob_home"]
    away = 1 - game["elo_prob_home"]
    tie = 0

    # Simulating game proper
    wildcard = random.uniform(0, 1)

    # Concoction to go from binary probabilities to ternary
    if ternary:
        if home > 0 and home < 1:
            home_odds = home / away
            tie_odds = 1
            away_odds = 1 - abs(home - 0.5) * 2

            home_odds1 = (home / away) / min(away_odds, tie_odds, home_odds)
            tie_odds1 = 1 / min(away_odds, tie_odds, home_odds)
            away_odds1 = (1 - abs(home - 0.5) * 2) / min(away_odds, tie_odds, home_odds)

            home = home_odds1 / (home_odds1 + tie_odds1 + away_odds1)
            tie = tie_odds1 / (home_odds1 + tie_odds1 + away_odds1)
            away = away_odds1 / (home_odds1 + tie_odds1 + away_odds1)

        elif home == 0:
            tie = 0
            away = 1

        elif home == 1:
            tie = 0
            away = 0

        else:
            raise ValueError("Probabilities must be floats between 0 and 1, inclusive")
    else:
        pass

    if wildcard >= 0 and wildcard < away:
        return 0

    if wildcard >= away and wildcard < away + tie and ternary:
        return 0.5

    if wildcard >= away + tie and wildcard <= 1:
        return 1

Now, before we go any further, I’d like to make some notes on this Elo implementation:

Per eloratings.net, the K constant is set to 50 as the Copa America is a continental competition.
Probabilities given by the Elo rating system are binary. As you can see in the code, I came up with a workaround to convert binary probabilities to ternary probabilities given that soccer admits three outcomes after a match (win, tie, lose).
I did not simulate score lines. Rather, I simply used probabilities to decide whether a team would win, tie, or lose. As such, I did not use the goal difference multiplier specified in eloratings.net.
I’ll be happy to talk about the workaround, but I wouldn’t take it as gospel. There might be ways to do this, but I did not research it. Wanted to have fun, not produce an academic-paper-worthy method, nor a sellable product.

Running the Simulations

And then, the simulation is orchestrated by the following lines of code:

# Reads in the matches and teams as dictionaries and proceeds with that data type
n = 100 # How many simulations to run
m = 100 # How many simulation results to collect

roster_pd = Parallel(n_jobs=5)(
    delayed(run_group_stage_simulation)(
        n, j) for j in tqdm(range(m)))

for t in tqdm(range(m)):
    if t == 0:
        roster = pd.merge(
            roster_pd[t],
            roster_pd[t+1]
            )
    elif t >= 2:
        roster = pd.merge(
            roster,
            roster_pd[t]
            )
    else:
        pass

Group Stage Results

Let’s check out the results for the Copa America 2024!

roster[not_sim].sort_values(
    by=[
        'group',
        'avg_sim_pts'
        ],
    ascending=False
    )

| Group | Team          | Mean points | 99% confidence interval |
|-------|---------------|-------------|-------------------------|
|-------|---------------|-------------|-------------------------|
|   A   | Argentina     |    6.76     |       6.52 - 7.05       |
|   A   | Chile         |    3.55     |       3.11 - 3.96       |
|   A   | Peru          |    3.05     |       2.74 - 3.53       |
|   A   | Canada        |    2.65     |       2.27 - 3.03       |
|-------|---------------|-------------|-------------------------|
|   B   | Ecuador       |    5.75     |       5.24 - 6.12       |
|   B   | Mexico        |    4.59     |       4.06 - 4.98       |
|   B   | Venezuela     |    3.38     |       2.99 - 3.92       |
|   B   | Jamaica       |    2.32     |       1.92 - 2.70       |
|-------|---------------|-------------|-------------------------|
|   C   | Uruguay       |    6.71     |       6.29 - 7.14       |
|   C   | United States |    4.71     |       4.31 - 5.30       |
|   C   | Panama        |    2.88     |       2.48 - 3.32       |
|   C   | Bolivia       |    1.83     |       1.42 - 2.18       |
|-------|---------------|-------------|-------------------------|
|   D   | Colombia      |    6.69     |       6.34 - 7.20       |
|   D   | Brazil        |    5.38     |       5.06 - 5.72       |
|   D   | Paraguay      |    2.68     |       2.33 - 3.06       |
|   D   | Costa Rica    |    1.45     |       1.23 - 1.77       |

You can see that:

- Group A should see Argentina easily top the group. However, while Chile has a clear edge over Peru, it’s not definitive since the confidence intervals between the two overlap. Canada should definitely end up in last place.

- Group B should definitely be: Ecuador, Mexico, Venezuela and Jamaica. No overlap in confidence intervals, and if we look at their team strengths, it’s clear the Ecuador tops the group. Although I would say this is the sleeper group… lower team strength of the tournament overall.

- Group C should definitely be: Uruguay, United States, Panama, and Bolivia. Panama might at best get a win against Bolivia, just like the 2016 Copa America (when it drew against eventual finalists Argentina and Chile).

- Group D should definitely be: Colombia, Brazil, Paraguay, Costa Rica.

(Honestly, this is a bit of a boring forecasting/prediction exercise. I guess the European Championship is more exciting…).

Now that we have decided which teams will make it to the playoffs, let’s simulate those and see who we predict to win the Copa America 2024!

Simulating knockout stage

The knockout stage is where it gets truly interesting. Using the results from the group stage simulations, I simulated the knockout rounds to predict the overall winner of the tournament.

n = 10000
playoff_results_teams = []
playoff_results_stage = []

for i in tqdm(range(n)):
    overall_result_teams = dict()
    overall_result_stage = dict()
    games = read_games("data/playoff_matches.csv")
    teams = {}
    
    for row in [
        item for item in csv.DictReader(open("data/playoff_roster.csv"))]:
        teams[row['team']] = {
            'name': row['team'],
            'rating': float(row['rating'])
            }
    
    simulate_playoffs(games, teams, ternary=True)
    
    playoff_pd = pd.DataFrame(games)
    
    # This is for collecting results of simulations per team
    for key in teams.keys():
        overall_result_teams[key] = collect_playoff_results(
            key,
            playoff_pd
            )
    playoff_results_teams.append(overall_result_teams)
    
    # Now, collecting results from stage-perspective
    overall_result_stage['whole_bracket'] = playoff_pd['advances'].to_list()
    overall_result_stage['Semifinals'] = playoff_pd.loc[playoff_pd['stage'] == 'quarterfinals', 'advances'].to_list()
    overall_result_stage['Final'] = playoff_pd.loc[playoff_pd['stage'] == 'semifinals', 'advances'].to_list()
    overall_result_stage['third_place_match'] = playoff_pd.loc[playoff_pd['stage'] == 'semifinals', 'loses'].to_list()
    overall_result_stage['fourth_place'] = playoff_pd.loc[playoff_pd['stage'] == 'third_place', 'loses'].to_list()[0]
    overall_result_stage['third_place'] = playoff_pd.loc[playoff_pd['stage'] == 'third_place', 'advances'].to_list()[0]
    overall_result_stage['second_place'] = playoff_pd.loc[playoff_pd['stage'] == 'final', 'loses'].to_list()[0]
    overall_result_stage['Champion'] = playoff_pd.loc[playoff_pd['stage'] == 'final', 'advances'].to_list()[0]
    overall_result_stage['match4'] = list(playoff_pd.loc[4, ['home_team', 'away_team']])
    overall_result_stage['match5'] = list(playoff_pd.loc[5, ['home_team', 'away_team']])
    
    playoff_results_stage.append(overall_result_stage)

Who wins this tournament?

results_stage['Champion'].value_counts(normalize=True)

Argentina        0.5527
Colombia         0.1459
Brazil           0.1133
Uruguay          0.0936
Ecuador          0.0591
Mexico           0.0174
Chile            0.0116
United States    0.0064

Well, who else? My dear and beloved Argentina is the team most likely to win this Copa America. In ten thousand simulations of the playoffs, it ends up as champion in more than half of them!

Now, bear in mind that I am running this playoff simulation assuming that these will be the teams that make it to the playoffs. Also, I am using the Elo ratings prior to the start of the tournament, which isn’t the most ideal choice given that these ratings will adjust to the results of the group stage games. As such, ratings at the end of the group stage would be a much better reflection of the teams’ true strength than ratings prior to the start of the tournament. So, I’d wait until the group stage is over to offer you a more realistic simulation of the playoffs!

Conclusion

The results of the simulation give us a probabilistic view of the potential outcomes of the Copa America 2024. While no model can predict the future with certainty, the Elo rating system provides a robust framework for understanding the relative strengths of the teams and making informed predictions.

Using data science techniques to simulate and analyze soccer tournaments not only enhances our understanding of the game but also adds an extra layer of excitement as we eagerly await the real-world outcomes. Whether you’re a data enthusiast or a soccer fan, I hope this simulation inspires you to explore the fascinating world of sports analytics.

Feel free to share your thoughts and let me know which team you’re rooting for in Copa America 2024! Remember that you can also play around with this simulation by forking/cloning my GitHub repo.