# Linearly Optimising Teams for PL Fantasy League

## Predict your own optimised FPL Teams with player data from PL using Linear Programming on Google Colab (IPython)

Football is one of the most popular sports in the world by the number of spectators and participants. The rules are pretty easy and the hence the sport can be played anywhere ranging from football fields to parks to your own streets.

While it is not possible for everyone to participate in the sport professionally, the federations have devised a few ways to get more people involved, whether be it the video games or through these addictive little applications on your phone called Fantasy Leagues.

# What are Fantasy Leagues?

Fantasy League is a competition in team sports in which the fans get to create what they think is the best team according to different players’ salary, their form, their talent and other criteria. The team that gets the most points, which in turn depends upon the players’ performance in that gameweek, gets some kind of rewards from the competition organizers.

Fantasy Leagues are a great way for non-professionals to showcase their talents of analyzing the sport but can also become really addictive, and can become dangerous in the leagues where real money is involved (especially in the case of unofficial Fantasy Leagues).

## Fantasy Premier League

Fantasy Premier League (FPL) is the fantasy league for the English Premier League in Football. It is the official Fantasy League for the competition and is managed by the Premier League itself.

They have a specific set of rules according to which the games are played and points are given, the prices of players are also decided by the Premier League itself based on the players’ performances, form and other attributes.

Once a user registers themselves for this league, they are provided with £100 million (virtual money) which they need to spend on selecting 15 players from the 20 teams who play in the Premier League for that respective season which goes on for a whole year divided in a total of 38 gameweeks.

Each player that they select in their squad is given points based on their performance in live matches which are held as per the Premier League fixtures. These points vary based on the individual player’s activity.

For example, points are earned by a player for:

*Playing a match**Scoring a goal**Assisting a team mate who goes on and scores a goal**Bonus Points if the player plays exceptionally well**Keeping a clean sheet (not for strikers)**Saving a penalty*

Points can be deducted for some or all of the reasons mentioned below:

*Receiving a yellow or red card**Conceding a goal (goalkeepers and defenders only)**Missing a penalty kick**Scoring an own goal*

# What is Linear Programming?

Linear programming (LP, also called linear** **optimization) is a method to **achieve the best outcome** (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by **linear relationships.**

One of the best things about Linear Programming is that as soon as you’ve written down a linear objective and constraints, you’re finished. All you need to do afterwards is to plug it into a solver and benefit from the results.

# What does FPL have to do with Linear Programming?

The selection of a FPL team turns out to be a constrained optimization problem. In this article, I transformed this into a linear programming problem by considering the different constraints such as price, form of player etc. and the linear objective was to maximize the number of points received by the team.

Further, these constraints were used to produce a team that performs the best based upon the data available in the present day scenario.

# Rules in FPL

There are many rules based on various things but the major rules and the ones which shall be considered for the purpose of this article are:

- You are constrained to a budget of
**£100.0 million**to choose players (where better players cost more money); - You can pick a total of
**15 players**, which should have**2 goalkeepers**,**5 defenders**,**5 midfielders**and**3 forwards**(the category of player is decided by FPL); - Out of these 15, you have to select 11 players to play in a certain gameweek. This 11 should contain 1 goalkeeper, at least 3 defenders, at least 2 midfielders and at least 1 forward.
*Only these 11 players will earn points*; - Only a maximum of 3 players can be selected in the squad from a single team;

# Methodology

With all the rules in mind, I decided to take the approach of selecting the best 15 players possible with regards to my cost and other constraints; and from there purchase 11 best players and 4 cheapest players (as substitute players don’t get points). Also, it is important to note that I am considering the previous data relevant for the prediction, but as football is also a game which involves a player’s form, this seems like a fair consideration.

Now, the rules had to be converted into constraints and the aim as the linear objective for making this a linear programming problem, so the linear objective function became to** Maximize the number of points**and the constraints were as follows:

- Select 15 players (Constraint 1)
- Select 2 goalkeepers (Constraint 2)
- Select 5 defenders (Constraint 3)
- Select 5 midfielders (Constraint 4)
- Select 3 forwards/strikers (Constraint 5)
- Use less than £100.0 million (Constraint 6)
- Not select more than 3 players from a single team (Constraint 7)

This clearly seems like a linear programming problem and hence I decided to proceed forward treating it as one.

I decided to use Google Colab as it is efficient and fast. I also used

`PuLP`

programming library of Python as it allows us to write down an objective function and constraints in a very intuitive way and instantly solve them.

## Importing Data

Firstly, I imported the data for the players in FPL from the Fantasy Premier League website with the help of their API in JSON format from this **link**. Now, for using it with `Google Colab`

, I had to use the `Pydrive`

library of Python.

!pip install pydrivefrom pydrive.auth import GoogleAuth

from pydrive.drive import GoogleDrive

from google.colab import auth

from oauth2client.client import GoogleCredentialsauth.authenticate_user()

gauth = GoogleAuth()

gauth.credentials = GoogleCredentials.get_application_default()

drive = GoogleDrive(gauth)

This code snippet simply authenticates the user for using Google Drive, a link comes up in the output of this and the user has to click on that, after which they have to copy the given text back to the space provided.

Next up, one has to upload the JSON file on their Google Drive and use its file id. The file id can be found by sharing the given file by link and in the link you’ll get a string similar to “1f4ZbuRae1uQ6kkI2XMuon37ATWAKrsK8”, just copy it and use it as in the next snippet to add the file to your Google Colab.

downloaded = drive.CreateFile({'id':"1f4ZbuRae1uQ6kkI2XMuon37ATWAKrsK8"})

# replace the id with id of file you want to accessdownloaded.GetContentFile('data.json')

# replace the file name with your file

Now, the JSON file has many columns of data for different players from which some would be useful for us while some would be not.

## Extracting Players’ data

Here, there were a number of features in the dataset. I cleaned up this data to only consider the features that are of use. I also concatenated the names of the players in order to better identify players in one column. I mapped each players position and teams, from other areas of the original json file, to make a more comprehensive dataset.

## Exploring the extracted data

I wanted to see what features were correlated to the total points a player earned in the season.

Here, it is clearly visible that the points are highly correlated to:

- Clean sheets: As clean sheets give players points, so obviously
*players with higher clean sheets have more points*. - Goals conceded: Although this shouldn’t be the case but the goals conceded also increases with the amount of matches a player plays, which further leads to increase in points.
- Bonus Points (bps): This means a player played well so obviously directly correlates to total points.
- Minutes: As from the previous one, the more minutes a player plays, higher the number of points.

Now, looking more into the details of the data, it’s visible from the averages and medians that *forwards earn the highest number of points while goalkeepers earn the least*. But, one has to also bear in mind that quantity of forwards and goalkeepers is much less than midfielders and defenders as this would help later on in making conclusions about the data.

## Exploring the data visually (Kernel Density Function)

Given the differences in counts, I thought it would be a good idea to have a look at some kernel density estimations to show the distribution of points based on the position of the players. This should give more of an indication as to the *distribution of points for each position*.

Most of the data is positively skewed and slightly leptokurtic (`kurtosis > 3`

), however, the goalkeeper distribution seems to be bimodal. This would make sense as most clubs will have 2–3 goalkeepers at a time and will play the better goalkeeper the majority of the time resulting in more time to earn points and the other 0 to few games as the others would only be switched in times of need when the 1st choice keeper is injured or needs to be rested.

## Imputations

Now, we have a few limitations in the data, one of them causes the positive skew and leptokurtic data, the explanation is that *there are many players who had been transferred into the premier league and had no historic data and had no points in previous seasons*, which makes it very difficult to pick the best choices.

But this further shows the naivety of the approach in which I have assumed that *players will replicate their performances from previous seasons or at least perform similarly*. Essentially, I would like to select some of the newly transferred players as I imagine they would be out to try and impress and potentially perform better.

Now, I set out to imputing this data to render it usable. Values were assigned to players based on their costs, then the players with same costs and positions had their values of the respective stat imputed with the help of average or median of the same. Some gaussian noise multiplied by the SD of a particular stat was added to imputed values to simulate random performance chances for players with no data and to add some variance to the dataset because there are a large number of 0 scoring players (25.55%). To ensure that the variance isn’t too erratic, I used a controlled gaussian by dividing by 1.5.

Here, I used a number of nested loops to store the medians and standard deviations specific to the position and cost of a player based on the featured columns. I made sure to use the absolute value as to avoid negative imputations and to help skew the data to a more normal distribution and remain consistent with the assumption that transferred players will be looking to perform well to impress. I softened the gaussian noise just a bit to avoid erratic data. After the imputation, I then visualised the distributions and kernel density estimations and the data begins to resemble a gaussian curve, except the goalkeeper estimation which appears to remain bimodal, the reason for which has been stated earlier.

## Modeling the Linear Problem

Since my dataset was now ready for use, I started to model the problem into system of linear elements. I started by modelling the problem to select the best possible 11 players that I can with as much money as I can afford to use. I imported everything from `PuLP`

and declared my Linear Programming problem and named it *“Fantasy Team”* and indicated that it is a maximising linear programming problem as I looked to maximise the number of points that I can get.

**Deciding the decision variables**

I decided to use each of the available players as the decision variables where their values were binary in that I chose them (1) or I didn’t (0). This is achieved by casting the variables as integer variables. There were 628 players for me to choose from, therefore there were *628 decision variables*.

**Developing the optimisation function**

Next, I assigned the *optimisation function*. This is what I am trying to maximise as previously stated in my modelling of the problem. Each player earned a given number of points in the previous season, or they have been imputed with values based on the median + (soft gaussian noise * standard deviation) dependant on the given cost and position of that player and other players within that cost bracket in the same position. These are taken and assigned to each of the decision variables (players) and used to construct the function as shown below.

**Considering Cash Constraints**

Remembering that I only had £100.0million (1000 in the code) to spend on the whole squad, and my tactic to pick the best 11 based on the available cash I had after buying the best of the cheapest 4 benched players. The cheapest players that I could choose from cost a total of £17.0million (170 in code). Therefore, this left me with £83.0 million (830 in code) to pick the best performing 11 players. I set the constraint so that I could spend 830 or less on selecting the best 11 players.

**Considering Player Constraints**

Next, I assigned the constraints based on the number of players that I am going to choose. Deciding to go with a traditional 4–4–2 formation (4 defenders, 4 midfielders and 2 forwards along with the goalkeeper) because of the number of defenders and midfielders available in the data and the exploration we did. I had the constraints reflect this. What the constraints represent is, of all the decision variables, some are defenders, some are midfielders and some are strikers; I made sure that I selected 4 defenders, 4 midfielders and 2 strikers.

*Goalkeeper Constraint — 1 goalkeeper**Defender Constraint — 4 defenders**Midfielder Constraint — 4 midfielders**Forward Constraint — 2 forwards*

**Considering Team Constraints**

I ensured that up to 3 players from any given team are selected and not any more. To achieve this, I used a `hash table`

to store all the teams and the players (decision variables) within the team where each of their values is equal to 1 so that I am only able to pick as many players as what is available, in this case 3.

**Solving the Linear Problem**

Now that we had put in all our decision variables, constraints and optimisation functions, it was time to finally solve the Linear Programming Problem! I had also asserted that this result is the optimal result.

# Results

Finally, a `pandas dataframe`

was built for all the decisions made following the optimisation model and appended it to the original dataset to see who had been selected to be in the dream team!

This is the team that was built by the model. As a football fan (and someone who plays FPL regularly), I can surely say that this was a very strong team and it actually got me a lot of points when the season restarted after Covid-19.

# Conclusion

Although football is a game in which any player can change the game on their day and the statistics can change, the data and statistics usually favour the truth. Also, the game is not only played on the field but is played mentally as well and the form matters a lot, which in turn is related to the price. Considering all these things, I can say that I was able to build a team that seemed to be a very strong team in that season. The football season was hampered by Covid-19 at that point but when it resumed, I used this team and actually got myself a lot of points as an experiment.

I still use almost the same script with updated data every week to get the right team to make some statistically correct decisions, although I don’t totally rely on the same script as it is still a naive approach.

In conclusion, it can be said that however complex the game of football might be, but if we consider the right equations, constraints and variables, we can predict almost all outcomes *near to total accuracy *with the help of linear programming.

*Finally,* *If you faced any difficulties, feel free to contact me or you can take a look at my github gist **here** (it may be a bit updated, but there are enough comments to help you through it) for any doubts.*