Predicting chess games results using LightGBM

Aluna Rizzoli, PhD
14 min readOct 16, 2023

--

In this project we train multiple ML models to predict results of chess games. In particular, we consider games played on Lichess, the second biggest website for playing chess, and one of the most successful open source projects. After gathering and cleaning the data, we use the gradient boosting framework LightGBM to train our ML models. Finally, we assess the models and highlight how they pick up some nice patterns of the data.

This project started with a simple question. If I am playing a game of chess with 3 minutes on the clock, against someone with a similar rating to mine, but with a much lower bullet rating (rating for even faster games); do I have any advantage? I had recently broken the 2500 for my bullet rating (see my progression here), and my other ratings were considerably lower… We shall see that the answer to the question is (on average) yes.

When I am not playing chess, I am doing research in pure mathematics, as a Collaborateur Scientifique at EPFL. You can read more about my research here. Special thanks to my friend Balys for his help and support!

Contents

FIDE rating chart taken from ratings.fide.com. Can you recognise the player?

Ratings and rating systems

All chess federations and chess websites use a rating system to rate players’ strength. Players generally start from a specific rating, and then move up and down according to their performance against other players. The International Chess Federation (FIDE), uses the old-school Elo rating system (named after Arpad Elo). In the chart above, we have the official Elo ratings of arguably the greatest chess player of all time, Magnus Carlsen. There are separate ratings for different classes of time controls, a recurrent practice for chess rating systems. The standard rating (in blue) is for games with more than 1 hour on the clock per player, the rapid rating is for games with around 15 to 25 minutes per person, and the blitz rating is for faster time controls. In online chess there are additional classes of time controls, bullet for games with around 1 minute on the clock, and hyper-bullet for even faster games.

Rating changes for Carlsen’s games at the 2023 World Cup

Magnus Carlsen recently won the 2023 World Cup, a 128-players direct elimination tournament where each player started with 90 minutes on the clock (+30s increment per move). Despite scoring 10.5 out of a possible 14 points, Magnus only gained 3.4 Elo rating points, the cost of being the highest rated player in the world. Indeed, winning against the young German prodigy Vincent Keymer only gained Magnus 3.1 Elo rating points, while losing cost him 6.9. In general the Elo rating change is simply a linear function of the difference in ratings, with some adjustments for extreme differences. The Elo rating system can potentially be gamed by competing only against players rated around 200 points higher (see this article). A comprehensive analysis of the Elo rating system and some of its shortcomings can be found in this paper by Professor Mark E. Glickman. Prof. Glickman developed the new rating system Glicko (with the newer version called Glicko2), which is used by the two biggest online chess websites, chess.com and lichess.org.

The Glicko system is fairly sophisticated, and it is great for having players’ ratings rapidly converge to their actual strength. In its Lichess implementation, everyone starts with a rating of 1500, a rating deviation of 350 and a rating volatility 0.06. The more you play, the closer the rating deviation gets to its lower bound of 45. A high rating deviation is an indication of uncertainty in the value of the rating, and it produces faster rating changes. The following function is taken from the Glicko2 paper.

import math
# Function to calculate the rating change for player1,
# with rating mu1 and rating deviation phi1, after result
# res against player2 with rating mu2 and rd phi2.
def ratingChange(
mu1: float,
mu2: float,
phi1: float,
phi2: float,
res: float,
sigma: float,
) -> float:
# Convert ratings and deviations to different scale.
multiplier = 173.7178
mu1 = (mu1-1500)/multiplier
mu2 = (mu2-1500)/multiplier
phi1 = phi1/multiplier
phi2 = phi2/multiplier
return (
# phiPrime returns a new estimate for phi1,
# while E(mu1, mu2, phi2) returns the expected score
# sigma is the rating volatility, generally equal to 0.05.
multiplier * phiPrime(mu1, mu2, phi1, phi2, sigma) ** 2
* g(phi2)
* (res - E(mu1, mu2, phi2))

# Function g used above.
def g(x: float) -> float:
return 1.0 / (math.sqrt(1 + 3 / (x**2 * math.pi**2)))

Let’s look at a couple of examples of rating changes.

# first game
print(ratingChange(2431,2350,45,45,1,0.05))
>>> 4.108348274115928

print(ratingChange(2350,2431,45,45,0,0.05))
>>> -4.108348274115928

# second game
print(ratingChange(2431,2350,100,45,1,0.05))
>>> 17.252918569132422

print(ratingChange(2350,2431,45,100,0,0.05))
>>> -4.029581010226979

In the first game, both players have the minimum possible rating deviation of 45, meaning that their rating changes are going to be as small as possible. A win for the first player gains them 4.1 rating points, corresponding to the points lost by the second player. In the second game, the rating deviation of the first player is much higher, therefore their rating changes more quickly, and the same result nets them 17.3 rating points. At the same time, the second player loses slightly fewer points, as their opponent’s rating was more uncertain.

Getting and cleaning the data

Every month around 100 million games of chess are played on the website Lichess.org. The monthly databases are available for anyone to download at https://database.lichess.org. A sample game (without the moves played) comes as a collection of consecutive rows, as follows:

Relevant to recent performance, it is possible to query the Lichess API for the following user information. Note however that given the amount of data, we do not rely at all on the API, but rather leverage the database to get the needed data-points.

One of the challenges consists in determining the players’ rating deviation using just the games database. In order to do this, we will use the rating change information that is given in the database for every game. We have seen earlier how this is a function of the players’ ratings, rating deviations, and game result. Therefore we can use numerical methods to estimate the rating deviations of each player.

import numpy as np
from scipy.optimize import fsolve

# We estimate the rating deviations for white and black using the rating
# change. See http://www.glicko.net/glicko/glicko2.pdf for the details
def estimateDeviations(
whiteRating: int,
blackRating: int,
whiteRatingDiff: int,
blackRatingDiff: int,
result: float,
): -> int
# We are going to make an assumption on the volatility sigma,
# which is correct for active players.
sigma, multiplier = 0.05, 173.7178

# step 1
cur_mu1 = whiteRating / multiplier
cur_mu2 = blackRating / multiplier
diff_white = whiteRatingDiff / multiplier
diff_black = blackRatingDiff / multiplier

# The following function outputs the rating change as a function
# of the two rating deviations,
# with the observed rating difference subtracted.
def functionToSolve(x):
return [
ratingChange(
cur_mu1, cur_mu2, x[0], x[1], result, sigma
)
- diff_white,
ratingChange(
cur_mu2, cur_mu1, x[1], x[0], 1 - result, sigma
)
- diff_black,
]

# Numerically solve to find the estimates. We give the minimum
# deviations as initial guesses,
# as that is what most active players have.
root = fsolve(
functionToSolve, [45 / multiplier, 45 / multiplier]
)

return [max(45, round(r * multiplier)) for r in root]

The first step towards having data to train our models, is to decompress the .zst file size. In this article we are using the latest available month, i.e. September 2023. The compressed size is around 30GB, which becomes 210GB when decompressed. We stream-decompress the file line by line, keeping only the information that we want. In particular we shall get rid of the information about which moves were played. The following python code shows how to do this, where the function updateGame saves the information that is being read in an appropriate data frame.

import zstandard
# stream decompress the monthly database line by line
with open(database_path, "rb") as fh:
dctx = zstandard.ZstdDecompressor(max_window_size=2147483648)
stream_reader = dctx.stream_reader(fh)
text_stream = io.TextIOWrapper(stream_reader, encoding="utf-8")
# use tqdm to get a progress bar.
# This file had around 2 billion lines to read
for line in text_stream:
current_game = updateGame(current_game)

We only keep games that are classified as bullet or blitz games. There are two main reasons for doing this. Firstly, these types of games with little time on the clock account for more than 80% of all games played. Secondly, many players rarely play slower time controls like rapid, and therefore have a provisional rapid rating (rating deviation > 150).

Games classification, taken from Lichess.org

Of these games we only keep rated games (as opposed to casual), that are not part of some tournament, as these have special rules. To each game, we add information about the players’ current losing or winning streaks. To do this, we keep a dictionary of players names for each time control, which we update with their winning or losing streak.

import pandas as pd
import sys
import statistics
from tqdm import tqdm
import numpy as np

# Read database.
df = pd.read_csv(database_path, nrows=number_of_rows)

# Initialise streaks dictionaries.
winningStreakDictBullet = {}
losingStreakDictBullet = {}
winningStreakDictBlitz = {}
losingStreakDictBlitz = {}
# Initialise all streaks at 0.
df["WhiteWS"] = 0
df["WhiteLS"] = 0
df["BlackWS"] = 0
df["BlackLS"] = 0

# Function to update the winning and losing streaks.
# It works because the games are already listed in chronological order.
def updateStreaks(df, WstreakDict, LstreakDict, row, index):
dics = [WstreakDict, LstreakDict]
dicsLabels = ["WS", "LS"]
colors = ["White", "Black"]
# Update the data frame by adding the pre-game streak if available.
for i in [0, 1]:
dic = dics[i]
for color in colors:
if row[color] in dic:
df.at[index, color + dicsLabels[i]] = dic[
row[color]
]

#Cycle thorugh the df row by row, using a progress bar.
for index, row in tqdm(df.iterrows()):
if row["Bullet"]:
# Update streaks for bullet games.
updateStreaks(
df,
winningStreakDictBullet,
losingStreakDictBullet,
row,
index,
)
else:
# Update streaks for blitz games
updateRDs(df, blitzRDDict, row, index)
updateLastGames(df, blitzLastGamesDict, row, index)
updateStreaks(
df,
winningStreakDictBlitz,
losingStreakDictBlitz,
row,
index,
)

We then do the same for players’ rating deviations as well as players’ lists of rating changes over the last 12 games. Instead of using the last-12-games rating change as is, we normalise the list of rating changes so that the result is comparable to other players, even if rating deviations are different.

# Compute the expected rating change if the player faced themselves.
# We shall use this as a normalization constant.
df["BlackNormalizer"] = df.apply(
lambda x: ed.ratingChange(
x["BlackElo"],
x["BlackElo"],
x["BlackRD"],
x["BlackRD"],
1,
0.05,
),
axis=1,
)

# Change the list of the last 12 games rating changes to a normalised sum.
df["Black12"] = df.apply(
lambda x: round(
sum([n / x["BlackNormalizer"] for n in x["Black12"]])
),
axis=1,
)

We then remove games where players have provisional ratings (rating deviation > 150). These are rating that show up on Lichess with a question mark. For each bullet game, we add the mean blitz rating of the players involved, if available. Otherwise we remove the game. Do vice-versa for blitz games.

# Define function to construct dict of players names with their respective
# mean rating, in either blitz (IsBullet = 0) or Bullet (IsBullet = 1).
def getMeanRatingByPlayer(df, IsBullet):
dfIsBullet = df[df["Bullet"] == IsBullet]
whiteDict = (
dfIsBullet.groupby("White")["WhiteElo"]
.agg(list)
.to_dict()
)
blackDict = (
dfIsBullet.groupby("Black")["BlackElo"]
.agg(list)
.to_dict()
)
completeDict = {
key: whiteDict.get(key, []) + blackDict.get(key, [])
for key in set(
list(whiteDict.keys()) + list(blackDict.keys())
)
}
return dict(
[
(player, round(statistics.mean(completeDict[player])))
for player in completeDict
]
)
# Save dictionary of mean ratings for every player in both time controls.
bulletDictMean = getMeanRatingByPlayer(df, 1)
blitzDictMean = getMeanRatingByPlayer(df, 0)

# Only keep games where both players appeear in both lists
df = df[
(df["White"].isin(bulletDictMean))
& (df["White"].isin(blitzDictMean))
]
df = df[
(df["Black"].isin(bulletDictMean))
& (df["Black"].isin(blitzDictMean))
]

# Add blitz and bullet ratings to every game in df.
# Keep the actual game ratings for the time control corresponding
# to the game played. (Same for white and blitz games).
df["BlackBulletRating"] = df.apply(
lambda x: x["BlackElo"]
if x["Bullet"] == 1
else bulletDictMean[x["Black"]],
axis=1,
)

We started with 93,218,629 games. At the end of the process we are left with a data frame with 34,339,878 games. In the following table we list the data frame size after each step that filtered out some games.

The resulting data frame has the following columns.

We can plot the distribution of the ratings in our processed database. The following plots were generated using seaborn. Note for example how Blitz ratings skew higher. This could be due to a different composition of the player pool.

Bullet vs Blitz rating distributions

We can also visualize the players’ difference in bullet and blitz rating. On average the two ratings are the same, but there is a considerable number of players that are much stronger in one of the two time controls.

Density plot of each player’s difference in bullet and blitz ratings

Training and assessing the models

We got and cleaned the data. We shall now proceed to train our ML models, taking inspiration from this article predicting results of football games. The main tool we are going to use is LightGBM, an open-source distributed gradient-boosting framework for machine learning. It is developed by Microsoft, it is based on decision tree algorithms and used for ranking, classification and other machine learning tasks.

We begin by splitting the data frame into three disjoint susbsets. One for training, one for validation and one for later testing.

import pandas as pd
import sys
# read the database with all features
df = pd.read_csv(database_path)

# split data frame for training, validation and future testing
df_train = df.sample(frac=0.7)
df_test_validate = df.drop(df_train.index)
df_validate = df_test_validate.sample(frac=0.3)
df_test = df_test_validate.drop(df_validate.index)

The following function trains a model according to a selected list of features in the data frame. Note that we did a good amount of hyper-parameter tuning for the most complicated model, and we are here using those parameters to train also the simpler ones.

import lightgbm as lgb

# define function to train model according to certain features
def trainModel(features):
# create training input components
y_train = df_train["Result"]
y_validate = df_validate["Result"]
y_test = df_test["Result"]
X_train = df_train.filter(features, axis=1)
X_validate = df_validate.filter(features, axis=1)
X_test = df_test.filter(features, axis=1)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(
X_validate, y_validate, reference=lgb_train
)

params = {
# multiclass since we have three possible results
"objective": "multiclass",
# parameter for model performance evaluation
"metric": "multi_logloss",
# performed some hyper-parameter tuning for the final model,
# which indicated the following as good parameter choices
"num_leaves": 195,
"learning_rate": 0.1,
"bagging_fraction": 0.8,
"bagging_freq": 5,
"verbose": 0,
# number of classes
"num_class": 3,
}

# train
gbm = lgb.train(
params,
lgb_train,
num_boost_round=300,
# use the evaluation dataset together with an early stopping round
valid_sets=lgb_eval,
callbacks=[lgb.early_stopping(stopping_rounds=20)],
)
# output is the model as an lgb object
return gbm

The simplest model we can train just uses the individual ratings of the players. We add-in the difference as an extra feature.

simpleModel = trainModel(["WhiteElo", "BlackElo", "RatingDiff"])

This simple model has a precision of 0.540, and a multi-logloss of 0.801. We plot the feature importance and confusion matrix. Note that most results are wins and losses, so draws are rather rare. For this reason the model is not far from being a binary classification model. At the same time, the data is very noisy, as game results are distributed with high variance, by nature of the quick time control used for the games.

Feature importance for model with 3features (as listed)

This simple model is already better than Glicko2 at predicting the game result. Glicko2 expectation can be calculated with the following functions, from the original paper.

import math
# function to calculate expected score for white
# inputs are the white rating, black rating, and corr. rating deviations
def GlickoE(WR, BR, WD, BD):
r1 = (WR - 1500) / 173.7178
r2 = (BR - 1500) / 173.7178
d1 = (WD) / 173.7178
d2 = (BD) / 173.7178

def g(x):
return 1.0 / math.sqrt(1 + 3 * x**2 / (math.pi) ** 2)

A = g(math.sqrt(d1**2 + d2**2)) * (r1 - r2)
return 1.0 / (math.exp(-A) + 1)

Glicko2 does not differentiate between white and black, so to give it a fair chance of competing against the other models, we adjust the expectation by skewing it towards white, as appropriate. Overall, the player with white scores 3% higher on average, in this dataset .

We list the games sorted by RatingDiff, and split them in 20 bins. For each bin, we calculate the Glicko2 average expected score, as well as the simple model’s average expected score. We plot them, comparing them to the actual observed results.

Average game result expectation per bin, sorted by RatingDiff

We can see that the simpleModel (E3f in the plot), precisely follows the actual observed values, while Glicko2 over-estimates the advantage of the strongest player. This does not mean that Glicko2 is not doing what it is supposed to do. Indeed, Glicko2 is meant to provide a rating system where players’ ratings converge to their true strength, and it achieves this even though it over-estimates the advantage of the strongest player. Our models are not at all a substitute for the rating system itself, as they do not provide a reliable way to implement rating changes. In fact, they mostly use the Glicko ratings themselves, in order to predict the score.

We train another two models, one using 7 features, and the most powerful one, using 15 features. We’ll see how they pick up some nice patterns. For both of them we give the confusion matrices and plot the feature importance.

Feature importance for model with 7 features (as listed)
Confusion matrix for model with 7 features
Feature importance for model with 15 features (as listed)
Confusion matrix for model with 15 features

Notice how in the model with 15 features, the (normalised) performance over the last 12 games is a more important feature than the winning and losing streaks. Furthermore, the exact time control is a relevant feature. The models do not differ greatly in accuracy and multi-logloss, however even small improvements are relevant in such a high-variance data-set.

Even when playing a bullet game, we shall see how the blitz ratings play a role. For this analysis, we compare the model with 3 features (which only has one set of ratings), to the model with 7 features (which includes both blitz and bullet ratings). We sort the data by BlitzRatingDiff, and compare the average expectations of 20 different bins.

Average game result expectation per bin, sorted by BlitzRatingDiff

Notice how the simple model underestimates the chances of players with higher blitz ratings than their opponents, and vice-versa. The model with 7 features does a great job at tracking the actual results.

We shall now see how the performance over the last 12 games is an indicator of good form. Players that have been winning tend to keep winning, and vice-versa. The model with 15 features picks up on this pattern.

Data split by lower bound on value of White12

We conclude with the same analysis of winning streaks (note that similar considerations are valid for losing streaks). In the following graph we split the data by minimum length of the white player’s winning streak. Long winning streaks are an indication of better future performance, which cannot be accounted for using the features in the E7f model

Data split by lower bound on value of WhiteWS

--

--

Aluna Rizzoli, PhD

I am a Postdoctoral Researcher in pure mathematics @EPFL. Previously @Cambridge and @Imperial College.