How to win at Fantasy Premier League using data — Part 1 — Forecasting with Deep Learning
EDIT: Updated player forecasts, team optimisation tool and team of the week can now be found each gameweek at https://twitter.com/solpaul7
With the start of August the Premier League juggernaut begins to cast its shadow on the British (and global) sporting calendar, drawing millions of players once again into the nine-month fantasy marathon.
As with all games that deal with uncertainty, the crystal balls are quickly deployed to divine the future and get the season off to a good start. Cue a host of experts telling you how Harry Kane hates August, where to find the best ‘differentials’, and the latest intelligence on a player’s new ugly girlfriend.
I prefer to use data to cut through the noise.
This season I’ll be sharing a number of tools that have helped my team finish top 1% in four of the last seven years, and comfortably top 10% in the other three. These are:
- Player score forecasts
- Team selection optimisation
- Look ahead fixture strength visualisation
This first part will focus on building a solid view of expected player performance, the bedrock of team selection throughout the season.
I don’t trust myself. I’m too easily fooled. Even worse, when I make mistakes things sometimes turn out nicely, producing some not-so-nice habits. It’s a disaster. This is why I need an emotionless, bias-less guide to tell me straight, what can I expect from a player?
In other words, I need a well built predictive model. Here are the steps I followed to build one:
- Get historical data: the hardest bit, luckily this repository on GitHub contains three years worth of FPL data (credit to vaastav who has put together a fantastic resource)
- Build training set: some wrangling to get the above into the format I need, and add in some additional data points
- Train model: I’ve just finished part one of the fast.ai course (which is fabulous), so this was a great opportunity to try out their tabular learner (a neat way of applying deep learning concepts to non-vision/text problems)
- Forecast: Finally, I built a dataset with every player’s 2019/20 season fixtures, and applied the model to obtain forecasts
I give a brief overview of the process below, but for more detail you can find the code to build the dataset, train the model, and generate forecasts in this GitHub repository.
I like to keep things simple to start with, so the current version of the model uses the following historical FPL data (one row for each player/gameweek):
I considered calculating trend data (e.g. the player’s recent form), but after watching a time series worked example on the fast.ai course I decided not to include it for now. One of the stated advantages of using neural networks as opposed to other approaches such as random forest or XGBoost is a reduced need for feature engineering. I like quick. I like simple. I went with it for now.
I still felt uncomfortable about big off-season team changes though. Squads change drastically and entirely new teams are promoted into the league. To address this issue I took inspiration from fivethirtyeight’s club soccer predictions methodology, bringing in team market value from Transfermarkt for the start of each season. So the following was also included in the model:
- Team market value at start of season (relative to all teams)
- Opposition market value at start of season (relative to all teams)
I won’t go into the finer details of the neural network architecture and training, but I will mention the interesting use of the concept of embeddings. Here’s the model:
Those seven ‘embeddings’ at the top are for each of the seven categorical features (player, position, team, etc.). Each feature is represented by a vector of a certain length. For example, the first embedding represents each player (1,055 in total) with a vector of length 79. The vector is learned during the training process, and it can be thought of as representing features of that player. Somewhere in Mohamed Salah’s vector is encoded his ability to score lots of goals, among many other more abstract but useful characteristics.
(As an aside, this is a concept that lies at the heart of a common approach to personalised content, collaborative filtering. It’s very likely that companies like Netflix are representing you in the same way — but instead of your vector telling them about goal scoring prowess, it’s describing how much you like horror movies.)
These embeddings were combined with the continuous variables (minutes and team market values) and then fed into a standard neural network. After training (with lots of tweaking and checking), I ended up with a model to predict player scores each gameweek.
Armed with a model I was ready to create the first forecast for the season ahead. Here I encountered a problem — the model was trained knowing the number of minutes each player played each week, not something we know for the future. For now I got around this by assuming each player will play the full 90 minutes. By no means perfect, but on this basis, here is gameweek 1 ordered by projected points:
Not bad. Salah at the top makes sense — he is the top scorer in the game two years running, and playing at home to the newly promoted Norwich. However, the 90 minute assumption is clearly causing some issues. For example, Phil Foden is not a regular starter for Manchester City. If he is unlikely to start then this forecast is far too optimistic. The same could be said for Origi and perhaps, to a lesser extent, Son. In the past these players have been very efficient at scoring in limited minutes, so the model is rightly projecting high scores for them, but to improve the predictions we clearly need some way to introduce information about likely minutes.
As an aside, Guardiola just described Foden as “the most talented player” he’s ever worked with (a group of players that includes Lionel Messi!). I’m cautiously optimistic to see the model recognising this talent based on data alone.
In the next few days the likely gameweek 1 starting line-ups will take shape. I’ll be using these to address the minutes issue. In the meantime, we now have a set of forecasts, but so what? In Part 2, I show how to use these forecasts to create optimal team selections.