Machine learning made easy

Rating and Predicting Football with Recurrent Neural Network and JAX

An optimized Elo rating

octosport.io

Published in

Geek Culture

12 min readSep 9, 2021

Introduction

When it comes to talking about sports or competition, the rating is always brought to the table. We always want to rank and measure how a player or a team compares to each other. This is exactly what rating does.

Over the last decades, many rating methods have emerged. For instance, Kenneth Massey¹ proposed a rating method for NCAA college football and uses linear regression under constraints over point spread. Recently Maystre² et al. (2019) proposed a Bayesian approach to pairwise comparisons in sports where the rating follows a time dynamic specified with a kernel function. Microsoft is using TrueSkill³, a rating system developed by Tom Minka to find players with similar skills to play together in online video games.

Among all these methods, there is one that is always cited in football: the Elo rating, from the physicist Arpad Elo. While originally developed to rank chess players, the model is now widely used in sports and specifically to establish the FIFA World Rankings.

In this article, we will quickly present the Elo rating system and the maths behind it, to focus on implementation and usage in a prediction framework. We will implement the model in Python using the JAX framework designed for high-performance machine learning and neural networks. The python code and examples in this article can be found on GitHub and data were provided by Sportmonks. Our implementation can be used for any sport or competition for both rating and predicting.

The ELO rating model

As the name refers to, the model is designed to rate. It does not mean that we can’t use it to make predictions, but the main output will be ratings so let’s start with that.

The rating equations

The Elo rating system is actually fairly simple and requires few algebras. In what follow we will use football team rating as an example, assuming team A plays team B. There are only three possible outcomes for the teams: win, lose or draw.

Let’s write the main rating equation straight away. The rating update of a team is given by⁴:

Team rating update

The index t stands for the time of a finished event. Since the prediction is made before the match starts, p is estimated knowing the pre-match information (t|t-1). But, for sake of simplicity we will index p with t in the rest of the article.

The equation simply says that the new rating equals the old rating plus a coefficient K-factor (κ) times the difference between the outcome (y) and the prediction (p). Say the K-factor is 20, the outcome is 1 if the team wins, 0 if the team loses, and 0.5 otherwise. Assume the rating of team A was 2300, our prediction was 60% chance of winning and the outcome was 1, the new rating becomes 2308.

If the team was expected to win (p=1) and win the game (y=1), the rating does not change. On the other hand, if the team was expected to lose almost surely (p=0) but has won the game this is a surprise. In this case, the team rating will increase accordingly.

The Elo rating, although very simple, take into consideration the surprise effect, and the team level differences to update the team’s rating.

In most rating systems, the K-factor is fixed. This parameter controls the speed of the rating update. The higher the rate, the stronger and quicker the rating becomes. For instance, the FIFA Elo rating uses different K-factor depending on the match importance⁴. What is not fixed is the prediction. We could use external predictions and use them as an input for the rating. But these predictions can also be internalized, and that’s the second equation of the Elo rating system.

Rating allows us to compare team only if they have been trained simultaneously. You can not compare two ratings of two different models.

The prediction equation

Using the rating of two teams, we can make a prediction. For instance, in the FIFA Elo rating, the prediction is made using the following equation

FIFA Elo prediction for team A to win

Where the parameter ρ equals the rating difference between team A and team B. Now we could ask where does that equation come from. In fact, the numbers 10 and 400 have been fixed and can change from time to time depending on a committee. But from a data scientist's perspective, we can do better.

With a bit of algebra, we can turn the previous equation as:

The modified FIFA Elo prediction equation

It turns out the left part of the equation is the so-called logit function used in logistic regression⁵.

Classical Elo rating system can be seen as a logistic regression on teams rating.

We also observe that the ratio in front of ρ is just a number that scales the rating difference and corresponds to the beta of the rating. That number would preferably be learned from the data.

Enhanced Elo rating

In the previous part, we have seen that the Elo rating can be interpreted as a logistic model. Let’s define a modern version of the system. We start with a similar probability function for team A playing against team B. The probability that team A wins is now:

Probability for team A to win

First, we replaced the number 10 from the previous equation with the exponential (e) number. Second, we replaced 1/400 with a parameter β. Third, the parameter γ will allow us to control draws. As you can see the probability is now the sigmoid function.

Without a possible draw, the probability for team B to win will be exactly 1 minus the probability of A to win. But when a draw can occur, the probability for B to win becomes:

Probability for team B to win if a draw is possible

In this case, γ>0 and we no longer have the sum of the probabilities equal to one. Thus, we can deduce the draw probability:

The draw probability

The rating update stays the same but we make two changes. First, the K-factor will be a parameter learned by the model. Second, we change the update rule in case of a draw. In the classical version, when a draw occurs the update uses 0.5 as the outcome against the probability of winning. In case of a draw, we update the rating according to the difference in winning probability of the two teams. For instance, if team A winning probability is 80% and team B winning probability is 5% but a draw occurs, the rating of A (resp., B) will decrease (resp., increase) by the K-factor times 75%. If they had the same probability of winning the rating won’t change.

Now we know how to compute the ratings and the probabilities. The set of parameters of the model contains the initial team’s rating, the K-factor, and the sigmoid parameters for a total of n+3 parameters. All we need is data, a loss function, and the optimization algorithm. But before that, let's see how we can represent the full model as a recurrent neural network.

A recurrent neural network representation

If you are not interested in neural networks you can skip this part as it will not affect the comprehension of the rest of the article. We can make a link between the enhanced version of the Elo rating model and the sequence neural networks. One example of these networks architecture is the LSTM used for language translation.

The enhanced version of the Elo rating system can be seen as a recurrent neural network where the ratings are the hidden states.

To make it simpler let’s assume the draw result is not possible like in basketball so y is binary. We have n teams, playing together week after week. A match consists of team A playing against team B. We are predicting the probability of team A to win (y=1).

We can represent the rating sequence of all matches as a recurrent network. Let’s call a cell of this network an Elo cell. This cell will output a prediction and update the rating after each match. The rating of each team can be interpreted as the cell hidden state.

Each cell takes as inputs the team ratings and a vector x, both of the size of the number of teams. The vector x is a vector of 0 except for team A and team B which value respectively 1 and -1. The cell is divided into two steps: the prediction step that outputs the probability of team A winning and the update step cell that outputs the new rating using the binary variable y. The cell can be represented as follows

Since the rating of two teams is not affected by the rating of other teams, the network can be represented as a very long sequence where matches are time-ordered.

The initial state is also a set of trainable parameters that correspond to the initial rating of each team.

Here is what the network looks like:

The last cells out the ratings to be used for predicting next events. For sake of simplicity, we won’t use the network representation in our implementation although it is exactly the same optimization with the same loss and number of parameters.

A gradient descent approach using JAX

Learning the parameters of the model is not straightforward and computer expansive. Indeed predictions depend on the ratings which also depend on the previous predictions for all teams.

In order to help us, we are going to use jax , a python library developed by Google that can perform auto differentiation using numpy.

Numpy has to be imported from JAX with the following command import jax.numpy as jnp

JAX can be used for training neural networks and many other applications that require differentiation and fast computation.

The loss function we will optimize is the negative log-loss that we are going to gradient descent. The negative log loss gives a measure of the quality of the probability calculated by the model. Having a dataset of matches, the steps are to compute the average loss, find the overall gradient, update the parameters, and repeat. Assuming we know how to compute the average log-loss, getting the gradient is magically done in JAX as follow

from jax import grad
from jax import jit#average_log_loss is a function
grad_function = jit(grad(average_log_loss))

The jit function compiles the function for fast computation. Note that grad compute the gradient given the first argument which is the dictionary of paramameters in our case. Lists and dictionaries are well interpreted by JAX. Now, grad_function is a function that takes the same arguments as the loss function and will return the value of gradient for the parameters and the data.

We will perform the classical gradient descent. We define a gradient step size learning_rate and a max number of iteration max_step . Then the minimization is done in following the loop:

for i in range(max_iter):
    gradient = grad_function(parameters, training_data)
    for key, val in parameters.items():
        parameters[key] = val - learning_rate * gradient[key]

By the recurrent nature of the model, we need to compute the full sequence of matches to get the average log-loss. In JAX we can use lax.scan . The function is designed to loop efficiently while carrying information along. It takes three main arguments as inputs: the scan_function that is the function to execute in each iteration, the carry where the parameters can be stored and the data that contains the data to loop on. It returns the last carry and output which is a list of all scan_function outputs. For instance, If our dataset contains 100 matches, the scan_function will be executed 100 times and output will contain 100 outputs.

In our case, we loop over the matches and the all process is:

def scan_loss(params, data):
    carry = dict()
    carry["params"] = params
    carry["rating"] = params["init"]
    carry, output = lax.scan(scan_function, carry, data)
    return {
        "carry": carry, #contains last updated parameters ans rating
        "loss_history": output[0], #all losses
        "rating": output[1], #all ratings
    }def average_log_loss(params, data):
    scan_loss_result = scan_loss(params, data)
    return -jnp.mean(scan_loss_result["loss_history"])

Here the scan_function will update the ratings, compute the probabilities, and get the loss for one match. It is similar to process one Elo cell. All losses are then contained in the list output and the average log-loss is taken with the average_log_loss function.

An English premier league example

Let’s run an example using the English premier league (EPL). The dataset consists of all EPL matches between 2016 and 2021, for a total of 1129 events. The data will be divided into three sets: the training set (up to Jun 2019), the validation set (Jun 2019-Jun 2020), and the test set (Jun 2020-Sep 2021). The training set is used to compute the gradient and update the parameters. The validation set is used to find the number of optimal gradient steps. The gradient descent is stopped when the validation loss stops decreasing. The test set will be our hold-out set.

To train the model we need to first put the data in the correct format, split the three sets, and train.

import pandas as pd
import datetime
from octopy.elo import dataset
from octopy.elo import elo#get the data
data = pd.read_csv('https://raw.githubusercontent.com/octosport/octopy/master/data/epl.csv')#format the data
elo_data = dataset.EloDataset(valid_date= '2019-06-01',test_date= '2020-06-01',time= data['date'])#split the data
elo_data.split_train_test(data[['home','away']].values,data[['home_goals','away_goals']].values)#build the model and train
model = elo.EloRatingNet(elo_data.n_teams_)
model.optimise(elo_data,learning_rate=0.1)

The model stopped the training at iteration 1010. The path losses are shown in the next figure.

The validation loss is far away from the training loss meaning that the model is overfitting. Dealing with overfitting is another problem that we won’t cover here but we can use the same technics as in neural networks. However, the test loss is close to the validation loss which means the model has generalized pretty well. We end up with a test loss of 1.001 which is pretty good for this type of model.

Now, the interesting part of this model is the ratings. Ratings are now optimal because they are inferred from the trained model. The team’s ratings are shown below:

You can see that the ranking is more or less the same as an official Elo rating system for the EPL. The difference is now you can make predictions that are consistent with optimized ratings. For instance, below are given the predictions for Arsenal- Norwich City on Sep 11, 2021:

model.predict_proba('Arsenal','Norwich City')

Which gives 66.8% for Arsenal, 14.4% for Norwich City, and 18.8% for a draw. These numbers are consistent with the market-implied probabilities. The python code of this EPL example is available on GitHub.

Conclusion

In this article, we have shown how we can enhance the classical Elo rating systems. We made the connection with sequential neural networks which open the door to many improvements that are beyond the scope of the article.

The model presented can be trained on any sports where team or player match results are represented by two or three issues. Not only does the model outputs the current ratings and ranking but it is also capable to make consistent predictions out of the box.

All code used in this article can be found on GitHub.

References

[1] K. Massey, Statistical Models Applied to the Rating of Sports Teams, Bluefield College, 1997.

[2] L. Maystre, V. Kristof and M. Grossglauser, Pairwise Comparisons with Flexible Time-Dynamics, KDD 2019.

[3] R. Herbrich, T. Minka and T. Graepel, TrueSkill(TM): A Bayesian Skill Rating System, 2007, Advances in Neural Information Processing Systems 20, MIT Press.

[4] See World football Elo ratings.

[5] See logistic regression.