Feature Selection for Fantasy Premier League

My Crush for data analysis started from the days I used to calculate some numbers prior to setting up my team before every game week and watch my team suck like a pro. Those days I never knew to write even a small piece of code and all the math I used to do, can be done by a third grader. Many years have passed since then, I think my math and reasoning has improved but my team still suck. Writing this blog, as I am getting ready for another great season of fantasy football this time with little more stat, math and faith.

So, I got this full data set of the fantasy premier league 2016–17 season. I did build a model to predict the points of a player in the upcoming game week based on some of the attributes of his performance from the previous game week. The model accuracy isn’t that great but there are few good Insights. Before the season begins I would like to share few things that could help you to pick better players and build better teams.

Before moving further, I will give a short Intro about ML. Machine learning is a field of data analysis where you train a machine by feeding data. In supervised learning, you feed the model with both the predictor values and the result. The machine learns the patterns in data and will attempt to predict the result, when a new case is presented before it. In our case, the predictor variables are all the features of the performance of a player in the previous game week and what we will try to predict is how much will he score in the upcoming game week.

So here is the list of all the features that I the model will use to predict the score for a player in the upcoming game week.( The predictors)

'yellow_cards_LW',
'tackles_LW',
'goals_conceded_LW',
'winning_goals_LW',
'errors_leading_to_goal_attempt_LW',
'saves_LW',
'influence_LW',
'key_passes_LW',
'transfers_balance_LW',
'goals_scored_LW',
'own_goals_LW',
'creativity_LW',
'bonus_LW',
'big_chances_created_LW',
'ict_index_LW',
'total_points_LW',
'penalties_missed_LW',
'attempted_passes_LW',
'completed_passes_LW',
'target_missed_LW',
'errors_leading_to_goal_LW',
'was_home_LW',
'recoveries_LW',
'clean_sheets_LW',
'assists_LW',
'open_play_crosses_LW',
'clearances_blocks_interceptions_LW',
'penalties_conceded_LW',
'offside_LW',
'ea_index_LW',
'penalties_saved_LW',
'fouls_LW',
'red_cards_LW',
'loaned_out_LW',
'bps_LW',
'dribbles_LW',
'threat_LW',
'tackled_LW',
'big_chances_missed_LW',
'minutes_LW',
'team_points_LW',
'team_result_LW',
'transfers_balance_CW',
'was_home_CW',
'opponent_rank_LW',
'opponent_rank_CW',
'tranfers_balance_cummulative_LW',
'FixtureRank_LW',
'TeamRank'

The features were a mix of variables from last game week and current game week such as Opponent , Opponent Rank ( Finishing position of the opponent team during last season), whether it was a home fixture, transfer balance current week ( Transfers IN — Transfers OUT) etc. I made sure all the variables that were picked from the current game week were something that is known before the match kicks off, else I will be building a model which is actually a cheat.

When I did some exploration on the data, I found that different attributes act differently based upon the player position while trying to predict the score.

Average threat and average total points are highly correlated for Mid-fielders and forwards but they have low correlation with Defenders and no correlation with Goal keepers.

Similarly, average creativity seems to be highly correlated with average points for midfielders than any other position. So If we end up trying to train a model on the whole dataset which includes all the positions we would end up adding noise to the model.

This could be tackled by building 4 different models for each position for which only the data from the players belonging that that particular position will be used for training. Hence I ended up training 4 different models for 4 different positions. GK-Def-Mid-Fwd.

When building models, the model not only tries to predict the output but also understands the importance of the features that were used to train the model. For example let’s consider a situation where we have two goal keepers. One guy who is average but has a very easy home game and another guy who had a very good last game but has a very tough away game coming up. It will never be easy for us to do the math and say who has better probability of getting a clean sheet, but for machine it is like a walk in the Goodison park. So let’s try and fetch the features that the models identified as important to predict the points of a player in the upcoming fixture position by position.

Goal keeper

Transfer balance in the current game week seems to be the top feature. It is actually the relative measure of how many people have brought the player in during the current game week. Opponent Rank seems to be the next important feature from the current game week. We usually give higher importance to home fixtures while picking GK, but it relatively has lower importance than many other metrics from the previous game week. Total points by the goal keeper has a mid level importance on the scale.

Defender

Tranfers_balance current game week continues to rule the top place. The Rank of a player’s teams (Player’s club’s finishing position) seems to be the next important thing followed by opponent rank. Here too, the upcoming fixture being home or not seems to be relatively very less important. Points in the last game week secured by the defender don’t even make it to the top 20 list of important features.

Midfielder

For a midfielder, last week’s threat seems to be a good predictor more than last week’s total points or bonus. So when you are picking a midfielder look for a midfielder who had higher threat last game than the points he scored last week.

Forwards

For forwards too, the total points scored in the last game week don’t make to the top 20. Surprisingly, goals conceded in last game week is up there in the list. There must be some spurious correlation between the two.

In each of these 4 feature importance charts, there are a handful of information which you can identify if you spend some time looking at the charts. I have written only about the tip of the iceberg, leaving the rest to explore it for yourself. I shall sign off now by wishing you all a great season of fantasy footballing.

Online course on data analysis - $12 dollars,
Book on python programming — $15 dollars
keeping Gabbiadini as a captain on a double game week and scoring 4 points — priceless.