Playing Fantasy Premier League with Linear Programming and Genetic Algorithms

6 min readOct 31, 2021

Data analytics in sports is today a consistent subject helping athletes, teams and clubs to optimise performance. Football is no exception; the Premier League and other third-party services provide public and premium data sources for data analysts. The Fantasy Premier League competition considers actual football athletes' performance and respective teams and assigns a score for each footballer. The total score for the player is mostly a combination of all his team footballers' points. Competitors can build imaginary football squads for each game week, transacting players if necessary. To be competitive, players need to spend a significant amount of time each week assessing football athletes and teams' performance to make decisions on squad picking for the next game. Automating necessary insights and using intelligent methodologies to support player selection for each game can save the players many hours analysing data and media. This article suggests Linear Programming and Genetic Algorithms as team recommenders engines to build potential Fantasy Premier League teams. Both consider all players’ performance in recent Premier League games and pick the best possible squad considering the competition constraints.

Methodology

Before describing the methodology used, I want to thank Vastaav Anand for his incredible work on cleaning the Premier Fantasy League API data and providing clean datasets.

This process source code is available on the GitHub repository FPL_LP_GA and consists of a mlflow experiment on a Jupyter notebook running in Google Colab (https://colab.research.google.com/).

As Figure 1 shows, the first step of this process is simply importing the data and doing some cleansing operations.

It is a goal of this approach to have recommendations on top-performing teams. I assumed that it would not be possible to have top-performance teams with players performing under the average. As represented in Figure 2 for each position Goal Keeper, Defender,…, the average score is calculated for all players who scored points during the selected time window. Only the players with a score above their respective average are included in the algorithm dataset input. This reduction of the data set size will also make the computation cheaper and faster.

All the linear programming and genetic algorithms execution run in the source code will log metrics, parameters and artefacts in the mlflow experiment to make it easier to compare results and other performance factors.

The last step of the main script opens an HTTP tunnel to the mlflow web app. This tunnel is necessary if the port is not open to an external network like in the case of the Google Colab machine. If you are running in your local machine or from a databricks environment, this step is unnecessary.

Linear Programming

For this problem, the Linear Programming optimisation formulation can be according to Figure 3. It’s recommended to have potential team candidates and not look only at a unique optimal team candidate. The budget limit was a parameter for several execution runs. Naturally, the execution with a maximum budget limit will provide the best possible team; however, squads with lower budget limits could provide potential cheaper and affordable suggestions for future transactions.

Figure 3 — Optimization problem formulation

Like the article (Linearly) Optimising Fantasy Premier League Teams by Joseph O’Connor, the PuLP Python library is also used in this process. So if you want to have a more detailed explanation of this library applied to this problem, it is recommended to read this article.

Genetic Algorithms

Genetic algorithms were largely used before the neural networks boomed and are still primarily used for machine learning models' hyperparameters tuning. There is a lot of public documentation explaining the theory and supporting maths when these algorithms perform an optimisation process. However, in this article, a more intuitive approach is used. A population of individuals are, in fact, a population of random candidate teams. As represented in figure 4 from the initial population, in blue, an iterative process, in green, will start to exchange players between the best performer teams to build even stronger teams potentially.

After each iteration, a new generation of teams will emerge with potential stronger team candidates. For simplicity reasons, the score function will be the same as the linear programming problem exposed in figure 3. The population size is one of the algorithm’s parameters. The larger the population size, the more expensive the computation will be. The number of iterations corresponds to the number of generations. Just like many optimisation algorithms, a stoppage criterion is an important tool. For simplicity reasons, the maximum number of generations/iterations and a maximum number of generations/iterations without improving the best score are configured as stoppage criteria.

According to the score function, the best teams are selected as parents for the following generation of teams (Figure 6). The number of potential parents is also a parameter.

After selecting the best parent teams, randomly selected footballers are exchanged, as represented in Figure 7, according to a cross-over ratio parameter. Large values of this parameter will make the process unstable, and small values will make the algorithm converge slowly to the optimal solution.

The final step before completing the next-generation computation is the mutation of the current generation (Figure 8). It must be taken into consideration that this is a random process, and there is no guarantee the current solution is converging to a globally optimal solution. For this reason, footballers will be randomly exchanged according to a mutation ratio parameter to try an even better solution in the solution domain.

During the algorithm execution, there is a monotonic evolution of the score (Figure 9). The algorithm stops its execution if the number of maximum iterations is achieved or the maximum number of iterations without improvement is achieved.

Comparing Results

Log all execution metrics, parameters and artefacts allow a straightforward comparison of the performance of each execution and confirms the output for safer decision making. mlflow is currently one of the best open-source tools for this purpose. Figure 10 shows all the execution's respective best scores. Linear Programming with a maximum budget limit top performs on scoring and the execution duration. Some potential squad candidates are represented in Figure 11.