Image for post
Image for post
Photo by Alex Motoc on Unsplash

How I used Data Science to get into the top 1% on the return to Fantasy Premier League.

James Asher
Aug 5 · 12 min read

After the 100 day shutdown of football and consequently the Fantasy Premier League competition the return to football was greatly anticipated. Even without fan presence at games people were desperate for the return of the beautiful game. The return of football was interesting as allowances had to be made so that the season could be finished as swiftly and as uninterrupted as possible. This resulted in a ‘festival of football’ in which there was games almost every day for six weeks. To account for the delay and how the season has changed over the last 100 days Fantasy Premier League (FPL) allowed unlimited free transfers before the first game of the return. This, of course, was a major opportunity.

There is no difficult maths involved in this, it’s something anyone with a very basic understanding of any programming language (even excel) could do in an afternoon.

On the return to FPL there were nine game-weeks remaining with the first being a double gameweek for four teams (Arsenal, Aston Villa, Sheffield United, and Manchester city) which means they all play twice. The double gameweek arose out of the need to get these four teams on to the same amount of fixtures as the rest of the league. This confused things slightly but also presented an opportunity.

Therefore, I had one main objective and that is to simply maximise the points in the remaining nine gameweeks. The double gameweek complicates things as there is a trade-off between picking players in the four teams mentioned above to maximise points in the first week but then having to potentially transfer them out afterwards. For example, Aston Villa players who sit 19th are not a good choice going forward but are likely going to be profitable in the double gameweek. This is a problem as FPL only allows one transfer per week without taking a points deduction. Transfers can be saved and added cumulatively, but only a maximum of two transfers can be held at the same time.

Therefore the main objective I had was to maximise the total points over the remaining 9 gameweeks while taking advantage as much as possible the opportunity that the double gameweek presented.

The logic is as follows, if we maximise return on investment (ROI) within our budget constraint (can you tell I’m an Economist?) and use our full budget (or as close to as possible) we will maximise points in the long run.

You can think about it in terms of two potential teams (Team A and Team B) with identical budget constraints of £100m. If we calculate the ROI for both teams such that the ROI of team a is greater than the ROI of team b:

Image for post
Image for post
Medium isn’t yet great at supporting Latex, hence the picture. For a good guide on how to incorporate latex into your writing check out https://medium.com/@tylerneylon/how-to-write-mathematics-on-medium-f89aa45c42a0

Which shows that the if we maximise our budget, the team with the highest ROI will be the team that produces the highest points. With a few caveats that we will discuss shortly.

Therefore, getting stuck in and downloading the data we have a data set which looks like this (Arsenal players are there due to nothing more than alphabetical order). The only real edits I had to make was to play around with the indexes and create some new variables. This example shows the head of the data set (the first five rows).

Image for post
Image for post
The dataset I used was very good and full credit for the scraping goes to (vaastav @ https://github.com/vaastav/Fantasy-Premier-League).

I decided to start off with choosing a goalkeeper as it tends to be a position that isn't changed much. This actually turned out to be a bit of a disaster but I feel like my methodology was sound — I was just unfortunate. Luckily the rest of the team made up for it. So as mentioned, the main stat I was going to look for in players was ROI.

Therefore, I simply organised players by ROI and highlighted the top players.

Image for post
Image for post
I was a hasty fool and didn’t read the README on the data when I downloaded it. I therefore missed the section which showed the variable which outlined a players position (it was called element type). Therefore, I have not filtered out players by position — luckily i know almost all of the players so this wasn't a problem for me. In future, however, I wont be so lazy and will filter by position.

As can be seen, Nick Pope has the highest ROI of any player let alone goalkeepers. His only potential rival is Dean Henderson. There was an argument for Dean Henderson as he does play for one of the teams that starts with a double gameweek. However, the following gameweek Sheffield play against Manchester United — the club who have loaned him to Sheffield — and is therefore ineligible to play in this game. Therefore, the double gameweek advantage is cancelled out and I decided to stick with Nick Pope as he has a superior points total as well ROI. Just for clarification I also graphed each teams remaining fixtures in terms of difficulty. I calculated the average current league position of every teams remaining opponent as can be seen below.

Image for post
Image for post
The difficulty represents the average league position. Therefore, paradoxically the higher the bar the easier the fixtures. All analysis was done with Python on a Jupyter notebook (pandas, matplotlib seaborn and numpy).

As can be seen, Manchester United on paper have the easiest fixtures of any team with an average position of 12th while Bournemouth have the most difficult. This is obviously not perfect as some mid table teams have little to play for and some teams at the bottom are fighting for their lives. As can be seen, Burnley’s remaining fixtures are slightly easier than Sheffield’s which bodes well for Nick Pope again. Therefore, the final decision to make him the first player in the team was a relatively easy one.

Due to the fact that the first week was a double gameweek, I decided to use my bench boost chip to double down on the gameweek possibilities. For anyone who doesn't know, the bench boost chip is a token that you can use once per season that allows you to gain points for your bench players as well as your starting 11. The issue with the bench is that after this gameweek they are unlikely to make it to the first team (unless mass injuries occur) and therefore it wouldn't be wise to spend a lot of money on them. I therefore set three main criteria for the players I would chose for the bench. Firstly, that they were in double gameweek teams, secondly that they were averaging above 60 minutes a game and finally that they were less than £5 million. This would give a robust bench of guaranteed starters. With four bench players both playing twice, we would expect a minimum of 16 extra points plus any other points they achieve over their combined eight games (a player receives one point for making an appearance and another for playing at least 60 minutes as well as big bonuses for goals, assists and clean sheets). I therefore, filtered the data set to players below five million pounds and to players who are playing at least 60 minutes a game on average . This also turned out to be an easy decision as in the top 10 there was only three players who fit into this category (Reina, Lundstrum and Mings). These players were then added to the team (bench) while I decided to keep the last bench spot free for some flexibility at this stage.

I followed an almost identical approach to what I had done before, I found the list of the players with the highest ROI and the easiest upcoming fixtures. The results here were strange with the three star Liverpool defenders (Van Dijk, Robertson and Trent Alexander-Arnold) coming out on top in terms of ROI. They however, are all extremely expensive (>7m) so at most it would make sense to choose two of them and possibly only one. Referring back to the fixtures graph it can be seen that Liverpool don’t exactly have the easiest fixtures and at this point had effectively won the league. Due to fear of rotation I decided to only go for one Liverpool defender. I chose Virgil Van Dijk due to him simply having the highest ROI. I followed on this by also choosing Doherty from Wolves as he was the best of the rest as well as Harry Maguire due to his reasonably high ROI, the fact that he had played every minute of every game, and Manchester United’s easy fixture list.

The only problem with the ROI model is that in FPL you have a choice of one player to be your captain who receives double points and one to be the vice captain who receives the armband if your captain doesn’t play any minutes that week. Therefore, having a team of mid-low value players all with high ROI might not be optimal since it will not take advantage of the double points. Therefore, I decided to have at least two players who were high value, high return players that would be captain choices each week and sacrifice a little of the budget.

Therefore, I sorted the data to show the ROI and total points for only the 15 top points scoring players.

Image for post
Image for post
Image for post
Image for post

Therefore, looking at the top 15 point scorers in terms of ROI, we already have the top 3 players in the team, which is reassuring. We can also exclude the other defenders as we have sorted the defence already. I made the decision to cut Aubamayang and Firminho immediately as their ROI is significantly worse than everyone else and their total points isn't high enough to compensate. I also did not feel that Rashford’s, Richarlison’s or Abraham’s total points were high enough so they were also removed. This left Vardy, Mane, Salah, and De Bruyne. De Bruyne was the first pick and an easy one. In terms of fixtures we have already seen that city had extremely easy fixtures, he also has second highest ROI and total points. City were also one of the double gameweek teams therefore he is an ideal pick for captain in the first week back. I also eliminated Mane from the choice at this point due to the fact that he had lower total points and ROI than Salah and of course both play for Liverpool.

I couldn't make my mind up about who to chose between Vardy and Salah at this point and decided to carry on with the rest of the team and come back to this choice later when It was clearer about how much budget I would have left.

As I’ve already mentioned, this analysis is very straight foreword. I was also concerned that it may overlook some players. For example, players that came in January (I’m sure all premier league fans can think of one player who has done particularly well after arriving in January) as well as players who have longstanding injuries who are now fit again after the break. Therefore, I decided to try and source these players.

I also wanted to look for players who weren't common picks, players that were performing well but just under the radar. This is beneficial because if these players do succeed and little or nobody else has them then it’s really beneficial. In FPL (especially towards the end of the season) players teams start beginning to look awfully similar to each others. People at the top of the table often attempt to mirror players below them so it makes it impossible for them to be caught.

Firstly, I created a new metric ROM (return on minutes) which is the same as ROI but instead of investment its how many points they get per minute — pretty self explanatory. This is useful in conjunction with ROI for the reasons previously mentioned, a player might have only played three or four games so their total points and resulting ROI will be really low. They may however have played extremely well in those three of four games and ROM will capture this. If a player has a high ROM and is expected to get a lot more game time than they previously have (recovered from injury etc.) then they may be a good choice. Therefore, I gathered the top 15 players in terms of ROM.

Image for post
Image for post

As can be seen there is some of the usual suspects there but also some new names that haven't appeared on many lists so far. A fair few of these players are substitute players who are going to continue getting the same minutes and are not worth considering (Ighalo, Batshuayi etc). There is one player who stands out (you might have guessed from the previous hint) — Bruno Fernandes. He was certain to play every game going forward and his ROM was not inflated by very scarce substitute appearances as he already had around ten games under his belt.

I also created similar lists for other factors such as time played. I created a top ROI for players who played almost every minute of the league so far, so called indispensable players — players that when fit will play the full 90 and rarely get injured. These would act as stable players in the squad that would not have to be rotated often. Players I could rely on to gather a good haul of points each week who I didn’t have to worry about them not playing.

I created a similar list with players that have been picked by less than 10% of people. As mentioned this was a good technique to benefit from hidden gems. Neil Maupay was an outstanding choice from this list and Brighton have some easier fixtures earlier on. Maupay was added to he team alongside Fernandes and Mahrez (high ROI, extremely high ROM and great fixture list).

At this point I decided that Salah was going to be too expensive considering who I had just added and the likelihood of rotation with Liverpool (this was proved correct as Salah was rested a few times in the run out). So I decided to add Vardy to the Team (he didnt last long in the team after a poor run of form). I complemented him with Rual Jiminez, the next highest ROI forward. I also added John Fleck as he was a low cost-high ROI player.

With only one space left I created a very simple function that took a price ceiling and a price floor as inputs and produced the players in this bracket with the highest ROI. Luckily Adama Traore fit perfectly into this with a really high ROI of 1.89. He was therefore the final piece of the jigsaw and his price maximized the budget. The final team can therefore be seen below. The team here probably didn’t maximise ROI as mentioned earlier. However, it maximised ROI at every possible point without making clear and obvious (shoutout to VAR) errors.

Image for post
Image for post

As can be seen the final team and their gameweek scores can be seen above. The only major disappointment was Pepe Reina, who didn’t start either of the two games for villa and a younger goalkeeper was chosen instead of him. This is something that couldn't have been predicted and was just unfortunate. Pope also blanked as City gave Burnley a good old thumping but Pope bounced back and proved to be a beast for the remainder of the season.

Image for post
Image for post

The final points were more than double the average and resulted in the team being in the top 1% (of more than 7 Million players) for the week. I feel that I could have increased this point total more in this individual week by making other choices but I didn’t feel I could do it without sacrificing the team going forward. So a result of 118 points with a team that is also heavily influenced by upcoming fixtures is, I feel, a strong result.

Throughout the remaining weeks I totalled 676 compared to the average 482 points which is more than 40% higher than average.

The next step is to design a proper selection algorithm centred around ROI and calculating expected goals and clean sheets for teams each gameweek. I feel like a selection algorithm which constantly transfers players in who have the highest ROI (moving average to account for form) which is also includes a weighting factor that controls for fixture difficulty. I plan to try and build this before the next season starts. I think this will take a little longer than an afternoon this time..

I hope you enjoyed the read and please reach out if you have any questions or would like to follow or see any of the notebooks.

The Sports Scientist

Where Sports and Data Science combine

James Asher

Written by

Probably the best Scottish Economist since Adam Smith. I write about Economics, Psychology, Philosophy, and Data Science.

The Sports Scientist

In depth view of the world of Sports through the lens of Data Science, and Analytics

James Asher

Written by

Probably the best Scottish Economist since Adam Smith. I write about Economics, Psychology, Philosophy, and Data Science.

The Sports Scientist

In depth view of the world of Sports through the lens of Data Science, and Analytics

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store