Predicting Euro 2024 with Snowflake ML

We’re days away from the 2024 UEFA European Championship; 24 teams competing over 30 days, all playing to lift the Euro Cup trophy at the Olympiastadion stadium in Berlin on July 14th.

As a lifelong England ⚽ fan, the Euro Championship is a great filler to the void of international football between the World Cup tournaments, and so with the competition imminent I decided to couple the love of the game with another activity I spend a great deal of time on; AI/ML with Snowflake.

This culminated in pretty comprehensive demo of some of the latest ML features being released by Snowflake and thus, during the 2024 Snowflake Data Cloud Summit, I partnered with my Sales Engineering colleague (and more importantly Italy supporter) Matteo Consoli to present how we used Snowflake ML to predict the outcome of Euro 2024.

Setting the Stage

The end-to-end flow of this whole prediction is pretty long, since it covers Ingestion, Transformation, ML Modeling, and finally the predictions. The entire code and instructions can be found across these 2 resources:

  • Code repo is here
  • Quickstart guide located here

I definitely encourage you to run through the Quickstart, it’ll cover the code much more comprehensively than this post.

This post will summarize a lot of this and focusing mainly on the tools, techniques, and rationale behind why we built this prediction process the way we did.

Few other things to note:

  1. The match outcomes are win, lose, or draw. The number of goals scored are not involved. This means that group stage may, in some cusp scenarios, not calculate the standings entirely correctly — this has been rare but may happen.
  2. Fixtures in the latter knockout stages (quarter-finals and beyond) may not be correct in how they are determined — the logic here seemed unclear and the main emphasis was demonstrating Snowflake ML.
  3. I’ve run this prediction model countless times, there’s been a pretty consistent trend of 2–3 teams coming through as winners but for the purposes of this post I’ll be running it again and using the exact output we get.
  4. I like to use the occasional GIF. I apologize in advance.

Assuming you can live with all this — let’s move on!

How could I write a football related post without a Ted Lasso reference?

Tools, Data, and Getting Started…

Since this was part of a Hands-on-Lab at Snowflake’s Data Cloud Summit, this was heavily geared to using Snowflake’s various ML tools. Whilst the concepts are pretty universal, it will be using the following features of Snowflake:

I really can’t emphasize how far Snowflake has come in its AI/ML capabilities. In my 4 years here, we’ve gone from nothing to essentially a full suite of tools for ML practitioners, and our product and engineering teams are still working hard!

And…this doesn’t even include the wave of GenAI features we’re putting out there with Cortex.

The suite of functions available under Snowflake ML

The code in the repo is pretty portable to other notebook environments, and also includes a version for one of my other favorite tools; Hex.

For this project we sourced our data from Kaggle, and used the following datasets:

In addition to this, we created a .csv file containing the upcoming fixtures for the Euros which we’ll use for prediction stage of this process. You’ll find the 3 files located in the github repo.

To get our environment in Snowflake created, you’ll need to run the setup script to create a database, and then the first notebook to get the data ingested.

Just to make sure this dataset is legit, I wanted to make sure England’s 1966 World Cup win over Germany at Wembley stadium was there:

How cool is this? Snowflake Notebooks support Python AND SQL

Data Transformation

First up, let’s get a meaningful dataset for model training. The 2 main datasets we have are ranking and historical match results, so this will contribute to the main set of features we’ll be building for the modeling stage.

Quick note — because we have rankings from 1992 , we’ll use 1994 as the start date of any historical match data (because I like round numbers) which’ll give us ~30 year of results. In addition we’ll remove friendlies from our dataset; these are not as representative of games as competition based ones, and managers tend to experiment with new players/lineups.

Ranking Data

We have all the historical match data, and with that we want to find the corresponding rank for both teams at the date of the match. In simple terms we’re going to take the match history and join to the rank table, that is sorted in date order — we have to join on a ≤ basis versus an = basis since the rank dates many not lineup precisely with the match date.

# History Dataframe
df_match_history = (
session.table("results")
.filter(
(F.col("date") >= "1994-01-01") &
(F.col("tournament") != "Friendly")
)
.sort(F.col("id"))
)

# Rank Dataframe
df_rank = session.table("rankings").sort(F.col("rank_date"), ascending=True)

# Join Match History with Rank
df_match_rank_home = df_match_history.join(
df_rank,
((df_rank['rank_date'] <= df_match_history['date']) &
(df_match_history['home_team'] == df_rank['country_full'])),
'left'
).sort(['id', df_rank['rank_date'].desc()])

# Add a row number and filter where row_number == 1
window_spec = Window.partition_by('id').order_by(df_rank['rank_date'].desc())
df_match_rank_home = df_match_rank_home.select(
F.row_number().over(window_spec).alias('row_number'),
'id',
'rank',
'rank_date'
).filter(F.col('row_number') == 1)

df_match_rank_home = df_match_rank_home.withColumnRenamed('id', 'home_id') \
.withColumnRenamed('rank', 'home_team_rank') \
.drop('row_number', 'rank_date')

We’ll do the same for the Away team, and once we join these 2 together we’ll have a dataframe containing every match ID along with the respective ranks for home and away teams:

Trailing Performance Data

The next set of features we’re going to create will relate to trailing performance going into a game, this will be an indicator of team form in the lead up to a match.

To do this we took the UDTF approach, where n trailing games are passed through to the UDTF and these are aggregated into statistics like total wins, total losses, goals difference etc.

output_struct = StructType([
StructField("ttl_wins", FloatType()),
StructField("ttl_losses", FloatType()),
StructField("ttl_draws", FloatType()),
StructField("goal_diff",FloatType())
])

@udtf(output_schema = output_struct,
input_types = [IntegerType(),DateType(), FloatType(), FloatType(), IntegerType()],
name = "calc_performance",
session = session,
is_permanent=True,
stage_location="@python_load",
packages=["pandas"],
replace=True)

class calc_ttl_performance:
def __init__(self):
self.fixture_id = []
self.date = []
self.goals_for = []
self.goals_against = []
self.is_home = []

def process(self, fixture_id, date, goals_for, goals_against, is_home):
self.fixture_id.append(fixture_id)
self.date.append(date)
self.goals_for.append(goals_for)
self.goals_against.append(goals_against)
self.is_home.append(is_home)

def end_partition(self):
df = pd.DataFrame(zip(self.fixture_id, self.date, self.goals_for, self.goals_against, self.is_home),
columns=['fixture_id', 'date', 'home_team_goals', 'away_team_goals', 'is_home'])

df['goals_scored'] = np.where(df['is_home'] == 1, df['home_team_goals'], df['away_team_goals'])
df['goals_conceded'] = np.where(df['is_home'] == 1, df['away_team_goals'], df['home_team_goals'])
df['wins'] = np.where((df['home_team_goals'] > df['away_team_goals']) & (df['is_home'] == 1), 1,
np.where((df['home_team_goals'] < df['away_team_goals']) & (df['is_home'] != 1), 1, 0))
df['draws'] = np.where(df['home_team_goals'] == df['away_team_goals'], 1, 0)
df['losses'] = np.where((df['home_team_goals'] < df['away_team_goals']) & (df['is_home'] == 1), 1,
np.where((df['home_team_goals'] > df['away_team_goals']) & (df['is_home'] != 1), 1, 0))


ttl_wins = np.sum(df['wins'])
ttl_losses = np.sum(df['losses'])
ttl_draws = np.sum(df['draws'])
goal_diff = np.sum(df['goals_scored']) - np.sum(df['goals_conceded'])

if np.isnan(goal_diff):
goal_diff = 0

yield (ttl_wins, ttl_losses, ttl_draws, goal_diff)

Using this we can then create a set of features for both teams that will represent trailing performance. By way of an example this is how we do it for the home team (note — we’re using the trailing 15 games but this can be adjusted):

df_home_team = (
df_match_history.join(
df_history,
(
(
(df_match_history["home_team"] == df_history["home_team"]) | (df_match_history["home_team"] == df_history["away_team"])
)
& (df_history["date"] < df_match_history["date"])
),
"left",
)
.select(
df_match_history["id"].alias("id"),
df_match_history["date"].alias("match_date"),
df_match_history["home_team"].alias("team_1"),
df_match_history["away_team"].alias("team_2"),
df_match_history["neutral"].alias("neutral"),
df_history["home_team"].alias("home_team"),
df_history["away_team"].alias("away_team"),
df_history["date"].cast(DateType()).alias("history_date"),
df_history["home_team_score"].alias("home_team_score"),
df_history["away_team_score"].alias("away_team_score"),
)
.with_column(
"is_home",
F.when(df_match_history["home_team"] == df_history["home_team"], 1).otherwise(0),
)
.with_column(
"row_number",
F.row_number().over(Window.partition_by("id").order_by(F.col("history_date").desc())),
)
.filter(F.col("row_number") <= 15)
.drop("row_number")
)

perf_udtf = F.table_function("calc_performance")

df_home_team = df_home_team.join_table_function(
perf_udtf(
df_home_team.col('id').cast(IntegerType()),
df_home_team.col('history_date').cast(DateType()),
df_home_team.col('home_team_score').cast(FloatType()),
df_home_team.col('away_team_score').cast(FloatType()),
df_home_team.col('is_home').cast(IntegerType())).over(partition_by='id',order_by=['history_date'])
)\
.select(
F.col('id').alias('home_fixture_id'),
F.col('goal_diff').alias('home_goal_diff'),
F.col('ttl_wins').alias('home_ttl_wins'),
F.col('ttl_losses').alias('home_ttl_losses')
)

After we do this for both teams, we’ll be left with a performance dataframe like so:

Final Features, and the Target Variable

A couple more features to consider…

Earlier we determined the rank for each team at a point in time (i.e. the match date). We’ll use these rank values to calculate a “rank delta” — essentially the difference between the home team and away team rank since this is a more useful vs the absolute values.

The other feature we can use relates to the match location, for any team playing in their home country it can definitely yield an advantage and so we’re able to use a “neutral” flag to specify whether the home team has this advantage.

Target Variable

For the initial round of model experimentation, we were trying to predict the outcome of a home win, draw, or loss (or away win). No matter what we did, we couldn’t get past an accuracy of 55% — whilst I think this is actually pretty representative of Football, we wanted better.

So instead we looked at this from a binary outcome approach; home win, or either a draw and away win. We then run predictions for each match twice; Home vs Away and then Away vs Home (we will refer to them as Team 1 and Team 2 from here on in).

Thus, with a simple python function we were able to rationalize our target variable into a 1 or 0, and then during the prediction stage we will run each match twice (reversing the teams in the second one) and have an implied result from the 2 predictions:

We’ll see the impact of this when we get to our modeling in the next section, but let’s take a look at the final feature set with the target variable:

Model Training

This is where things get really cool with Snowflake. Pun totally intended.

Hyperparamater Optimization

Jumping straight into things, we’re going to start with Hyperparameter Tuning which is really easy with Snowflake ML. In our scenario we’re going to use GridSearchCV with the following hyperparameter grid:

hyperparam_grid = {
"n_estimators": [50, 100, 200, 300, 400, 500],
"learning_rate": [0.01, 0.05, 0.1, 0.2, 0.3, 0.4],
"max_depth": [3, 4, 5, 6, 7, 8]
}

This means with a 6x6x6 grid, across 5 folds we’re evaluating 1,080 different models. Sounds like it’s going to take a while to run right?

Thank you Dwight - no, it isn’t. With a single line of code we’re able to scale up our XS Snowflake Warehouse to an XL like so:

alter warehouse euro2024_wh set warehouse_size = xlarge

This should run in a matter of seconds, and we’ll now have a much larger compute resource available to run our HPO:

from snowflake.ml.modeling.preprocessing import StandardScaler
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.modeling.model_selection.grid_search_cv import GridSearchCV

train_data = session.table(f'final_data')

FEATURE_COLS = [c for c in train_data.columns if c != "GAME_OUTCOME" and c != "ID"]
LABEL_COLS = ["GAME_OUTCOME"]

pipeline = Pipeline(
steps = [
(
"scaler",
StandardScaler(
input_cols=FEATURE_COLS,
output_cols=FEATURE_COLS
)
),
(
"GridSearchCV",
GridSearchCV(
estimator=XGBClassifier(random_state=42),
param_grid=hyperparam_grid,
scoring='accuracy',
label_cols=LABEL_COLS,
input_cols=FEATURE_COLS
)
)
]
)

pipeline.fit(train_data)

This ran in 1 minute and 27 seconds. 1,080 models in just under one and a half minutes.

What’s essentially happening here is Snowflake is pushing this execution down to our XL warehouse and parallelizing this process…pretty awesome I think (though not as awesome as David Beckham’s right foot).

Best. Goal. Ever.

We can now get our optimal parameters along with the best accuracy:

pipeline.fit(train_data)

sklearn_hp = pipeline.to_sklearn()
optimal_params = sklearn_hp.steps[-1][1].best_params_
score_dict = {"best_accuracy": sklearn_hp.steps[-1][1].best_score_}

Let’s take a look:

{
"best_accuracy":0.7169657925000794
}

{
"learning_rate":0.05
"max_depth":3
"n_estimators":100
}

So now we have an accuracy of 71.6%!

Before we move on we will want to scale back down our warehouse to an XS with the same method we used before:

alter warehouse euro2024_wh set warehouse_size = xsmall

Model Training

Now we could take our best model from the HPO process and register that straight to the Model Registry, but I want to demonstrate model training with Snowpark ML as well.

Model Training looks similar to our HPO and again uses Snowpark ML, thus pushing down the compute usage to Snowflake’s virtual warehouses.

from snowflake.ml.modeling.preprocessing import StandardScaler
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.modeling.metrics import *

train_data, test_data = df_training.random_split(weights=[0.8, 0.2], seed=0)

FEATURE_COLS = [c for c in train_data.columns if c != "GAME_OUTCOME" and c != "ID"]
LABEL_COLS = ["GAME_OUTCOME"]

pipeline = Pipeline(
steps = [
(
"scaler",
StandardScaler(
input_cols=FEATURE_COLS,
output_cols=FEATURE_COLS
)
),
(
"model",
XGBClassifier(
input_cols=FEATURE_COLS,
label_cols=LABEL_COLS,
max_depth=optimal_params['max_depth'],
n_estimators = optimal_params['n_estimators'],
learning_rate = optimal_params['learning_rate']
)
)
]
)

pipeline.fit(train_data)

Once we’ve executed this, we’ll save our model to the Snowflake Model Registry.

from snowflake.ml.registry import Registry

reg = Registry(session=session)

model_name = "EURO_24_GAME_PREDICT"
model_version = get_next_version(reg, model_name)

reg.log_model(
model_name=model_name,
version_name=model_version,
model=pipeline,
metrics={
'training_accuracy':training_accuracy,
'eval_accuracy':eval_accuracy
},
options={
'relax_version': False,
'embed_local_ml_library': True
}
)

m = reg.get_model(model_name)
m.default = model_version

At the end of this stage we have a working model, and we’re getting close to the best part of this post.

There’s always one more thing

Simplifying our Prediction Pipeline

This step involves creating several Python UDTFs and Stored Procedures, I’m not going to put all the code here but you can find the link to this below. The intent behind these python objects is to:

  1. Create some reusable code for the various stages of our tournament prediction pipeline, for example the data prep.
  2. Abstract the code for converting the model predictions into the game outcomes, and subsequently the next round of fixtures.
  3. Creating a “Penalty Shootout” function that will determine the outcome of the knockout stage should one of the matches result in a draw.

Let’s quickly talk about number (3) to clarify — we attempted to make a separate prediction model for penalty shootouts, which was even harder than match predictions. Anyone who has sat through a penalty shootout knows how unpredictable and painful they really are (looking at you England).

Therefore we opted to create a simple approach using the sigmoid function, and factoring in the rank delta to nudge the outcome in one direction or another — ultimately it comes down to a coin toss with some favor towards the higher ranked team:

def penalty_shootout(team_1, team_2, rank_delta) -> str:

boundary = 25

adjustment_factor = max(min(rank_delta, boundary), -boundary) * 0.05
sigmoid = 1 / (1 + np.exp(-adjustment_factor))
threshold = 50 + (sigmoid - 0.5) * 100

random_number = np.random.randint(1, 101)
if random_number >= threshold:
return team_1
else:
return team_2

So, to review all the functions and procedures created please review this notebook, but in summary we will be left with the following procedures that will be called in the prediction stage:

  • CALCULATE_R16_GAMES
  • CALC_KNOCKOUT_GAMES
  • PREP_PREDICTION_DATA
  • PROCESS_GROUP_PREDICTIONS
  • PROCESS_KNOCKOUT_PREDICTIONS

Predictions!

We got there. Finally we get to the fun part — thanks for hanging in there with me.

First we’re going to load our model from the Snowflake Model Registry, which will be used for every game prediction:

reg = Registry(session=session)

mv = reg.get_model("EURO_24_GAME_PREDICT").default

2 lines — simple!

Group Stage

So now we’re going to run our group stage procedures — this will create our prediction data, run predictions, and then convert those predictions into results where we’ll partition them by Qualifying Group:

session.call('prep_prediction_data',0,36)

pred_df = session.table('data_for_predictions').order_by('id')
prediction = mv.run(pred_df, function_name="predict_proba")
prediction = prediction.with_column(
'output_game_outcome',
F.iff(F.col('predict_proba_1') > win_threshold,1,0))
prediction.write.save_as_table(
'predictions',
mode='overwrite',
table_type='temp')

session.call('process_group_predictions')

Let’s just take a second to see how easy it is to run the predictions — the actual model inference is 1 line of code:

prediction = mv.run(pred_df, function_name="predict_proba")

(The rest was doing some wrangling specifically to this pipeline to convert the prediction outputs to actual match outputs).

That’s it! And what else is equally as awesome? We can call use our models in SQL code too, so our Data Analyst friends can use the models created by Data Scientists:

WITH SNOWPARK_ML_MODEL_INFERENCE_INPUT AS (
SELECT
*
FROM
data_for_predictions
ORDER BY "ID" ASC
),
MODEL_VERSION_ALIAS AS MODEL EURO2024.PUBLIC.EURO_24_GAME_PREDICT

SELECT
*,
MODEL_VERSION_ALIAS!PREDICT_PROBA(
ID,
NEUTRAL,
TEAM_1_GOAL_DIFF,
TEAM_1_TTL_WINS,
TEAM_1_TTL_LOSSES,
TEAM_2_GOAL_DIFF,
TEAM_2_TTL_WINS,
TEAM_2_TTL_LOSSES,
TEAM_1_VS_TEAM_2_RANK
) AS TMP_RESULT
FROM
SNOWPARK_ML_MODEL_INFERENCE_INPUT

Ok, back to the competition — we get these results for the Group Stage:

England win all 3 games, so it must be accurate.

Round of 16

Calling our procedure to calculate the next fixtures, gives us the following setup for the round of 16:

session.call('calculate_r16_games')
session.table(f'fixture_{user_name}') \
.filter(F.col('"ROUND NUMBER"') == 'Round of 16') \
.order_by('"MATCH NUMBER"')

And we call a similar block of code to run these predictions:

session.call('prep_prediction_data',37,44)

pred_df = session.table('data_for_predictions').order_by('id')
prediction = mv.run(pred_df, function_name="predict_proba")
prediction = prediction.with_column(
'output_game_outcome',
F.iff(F.col('predict_proba_1') > win_threshold,1,0))
prediction.write.save_as_table(
'predictions',
mode='overwrite',
table_type='temp')

session.call('process_knockout_predictions')

And the results are in!

Belgium are through on penalties — so we needed to leverage our penalty function for this, nice!

Quarter-Finals

So, similar code as before to calculate the Quarter-Finals:

session.call('calc_knockout_games','Quarter Finals')
session.table(f'fixture_{user_name}') \
.filter(F.col('"ROUND NUMBER"') == 'Quarter Finals') \
.order_by('"MATCH NUMBER"')

And then we run the predictions again (same as before, but we just change the fixtures ID parameters in the prep_prediction_data function):

session.call('prep_prediction_data',45,48)

pred_df = session.table('data_for_predictions').order_by('id')
prediction = mv.run(pred_df, function_name="predict_proba")
prediction = prediction.with_column(
'output_game_outcome',
F.iff(F.col('predict_proba_1') > win_threshold,1,0))
prediction.write.save_as_table(
'predictions',
mode='overwrite',
table_type='temp')

session.call('process_knockout_predictions')

France who are one of the favorites are out on penalties. As are Italy — sorry Matteo 😂

Semi-Finals

The penultimate round…

session.call('calc_knockout_games','Semi Finals')
session.table(f'fixture_{user_name}') \
.filter(F.col('"ROUND NUMBER"') == 'Quarter Finals') \
.order_by('"MATCH NUMBER"')

And then the predictions:

The Final…

So here we are, the Euro 2024 Final — July 14th in Berlin.

England versus Portugal.

Out final code execution will determine the winner…we’ll skip the fixture creation since we know what that will be. So next we just need to execute our procedures to run predictions and calculate the winner:

pred_df = session.table('data_for_predictions').order_by('id')
prediction = mv.run(pred_df, function_name="predict")
prediction.write.save_as_table(
'predictions',
mode='overwrite',
table_type='temp')

session.call('process_knockout_predictions')

Well there you have it: England are going to win Euro 2024! 🏆🏴󠁧󠁢󠁥󠁮󠁧󠁿

Thank you Peter Crouch

And here’s the predictions collated into a bracket for the knock-out stages:

Summary

Aside from the awesome outcome for my home nation, I hope this was a good demonstration of what we can now do with Snowflake ML and Snowpark;

And all this ran without data ever leaving the boundary of a Snowflake account — zero data movement.

Finally, please do check out the Quickstart and take a look at the docs for all the features used in this post (linked above).

A final note … I know what you’re thinking. But yes, I only ran this once for this article — I think we can all agree however, the only questionable outcome here is England winning 2 penalty shootouts.

In the Hands-on-Lab that Matteo and I delivered at Snowflake Summit, I think Portugal won, and France have come through several times in testing…so if I could pick the winner from 3 teams it would be:

  • England 🏴󠁧󠁢󠁥󠁮󠁧󠁿
  • France 🇫🇷
  • Portugal 🇵🇹

And with that, I leave you with my final GIF and probably the most important message from this post:

--

--