Picking my FPL team based on data: A how-to guide (Part 1)

5 min readJul 22, 2021

In this brief data exploration tutorial, I will be taking a look at the FPL API to retrieve player stats from last season and try to come up with features that may be useful in picking a team.

You can find the notebook here: https://github.com/rukeshdutta/FPL/blob/main/FPL.ipynb

Importing Libraries & setting up the display to show 500 rows and 500 columns

We will be importing the requests library to request the fantasy premier league’s API response & converting it to multiple DataFrames

import requests
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

Base URL

This is the base URL for the data, it contains several keys, but we are mostly interested in 3 keys for now: Element, Element_type & Teams. The response here provides us data for the last season now, once the new season starts it will populate the current season’s data. I have added the columns of the DataFrames in the next few lines.

base_url = 'https://fantasy.premierleague.com/api/bootstrap-static/'

Base URL for historical data, we will use this later

This is the base URL for the historical data for the players, we need to add player IDs which can be found in the element DataFrame, and generate responses for the players we are interested in. We can do a moving average analysis of the players to check for form indication

historical_data_base_url= 'https://fantasy.premierleague.com/api/element-summary/'

Getting the data as a JSON and converting it into relevant data frames

r = requests.get(base_url)
json = r.json()

elements_df = pd.DataFrame(json['elements'])
elements_types_df = pd.DataFrame(json['element_types'])
teams_df = pd.DataFrame(json['teams'])teams_df.columnsIndex(['code', 'draw', 'form', 'id', 'loss', 'name', 'played', 'points',
       'position', 'short_name', 'strength', 'team_division', 'unavailable',
       'win', 'strength_overall_home', 'strength_overall_away',
       'strength_attack_home', 'strength_attack_away', 'strength_defence_home',
       'strength_defence_away', 'pulse_id'],
      dtype='object')elements_df.columnsIndex(['chance_of_playing_next_round', 'chance_of_playing_this_round', 'code',
       'cost_change_event', 'cost_change_event_fall', 'cost_change_start',
       'cost_change_start_fall', 'dreamteam_count', 'element_type', 'ep_next',
       'ep_this', 'event_points', 'first_name', 'form', 'id', 'in_dreamteam',
       'news', 'news_added', 'now_cost', 'photo', 'points_per_game',
       'second_name', 'selected_by_percent', 'special', 'squad_number',
       'status', 'team', 'team_code', 'total_points', 'transfers_in',
       'transfers_in_event', 'transfers_out', 'transfers_out_event',
       'value_form', 'value_season', 'web_name', 'minutes', 'goals_scored',
       'assists', 'clean_sheets', 'goals_conceded', 'own_goals',
       'penalties_saved', 'penalties_missed', 'yellow_cards', 'red_cards',
       'saves', 'bonus', 'bps', 'influence', 'creativity', 'threat',
       'ict_index', 'influence_rank', 'influence_rank_type', 'creativity_rank',
       'creativity_rank_type', 'threat_rank', 'threat_rank_type',
       'ict_index_rank', 'ict_index_rank_type',
       'corners_and_indirect_freekicks_order',
       'corners_and_indirect_freekicks_text', 'direct_freekicks_order',
       'direct_freekicks_text', 'penalties_order', 'penalties_text'],
      dtype='object')elements_types_df.columnsIndex(['id', 'plural_name', 'plural_name_short', 'singular_name',
       'singular_name_short', 'squad_select', 'squad_min_play',
       'squad_max_play', 'ui_shirt_specific', 'sub_positions_locked',
       'element_count'],
      dtype='object')

Let’s have a look at one row

elements_df.head(1)

elements_types_df.head(1)

teams_df.head(1)

Element type is basically a numerical column that needs to be mapped with the element_type dataframe to retrieve player position, we get team names from the teams_df

elements_df['element_type'] = elements_df.element_type.map(elements_types_df.set_index('id').singular_name)

elements_df['team'] = elements_df.team.map(teams_df.set_index('id').name)

We will use the ID column from elements to pull historical data

player_name_id = ['id','first_name','second_name','team']
elements_df.head(5)[player_name_id]

Element_type is player position which needs to be mapped with element_type df & team needs to be mapped with teams_df

elements_df.loc[:,'code':].head()

Finding the best players with the provided ICT index rank

They provide an index rank for the players based on Influence, Creativity, and Threat. We will take a look at the top players based on the ICT index, Top points last season & Value_season which is total_points divided by the current cost of the player

player_detail_cols = ['id','first_name','second_name','team','ict_index','total_points','now_cost','value_season']elements_df.sort_values(['ict_index_rank']).head(10)[player_detail_cols]

Finding the best players with max points last season

The ranking of the players change slightly, with Harry Kane coming up to number 2 instead of 3 in the ICT rank, they also provide an ICT rank type-wise which is the rank in the player’s position

elements_df.sort_values(['total_points'],ascending=False).head(10)[player_detail_cols]

Trimming the data frame to keep relevant columns for now

There are a lot of columns that we can ignore now for the time being. I think the following columns will be helpful in finding out a perfect team

trimmed_elements_df = elements_df[['first_name','second_name',\
                                   'id','ict_index',
                                'code',\
                                'team',\
                                'element_type','selected_by_percent','now_cost',\
                                'minutes','transfers_in',\
                                'ict_index_rank', 'ict_index_rank_type',\
                                'value_season','total_points','points_per_game']]

Creating new features which will be used to create another ranking mechanism

Points per minute is a useful metric because if a player is a non-starter, we need to pick players who have the maximum output in the shortest amount of time played, since the number of games has not been provided we are taking minutes divided by 90 to calculate total matches played. Note this may not be 100% accurate

trimmed_elements_df['points_per_minute'] = trimmed_elements_df['total_points']/trimmed_elements_df['minutes']
trimmed_elements_df['matches'] = trimmed_elements_df['minutes']/90
trimmed_elements_df['value_season'] =trimmed_elements_df['value_season'].astype(float)

Filtering down more with players less than 900 minutes in the season & value_seaon which is basically (total_points/cost) less than 15

trimmed_elements_filtered = trimmed_elements_df[(trimmed_elements_df['minutes']>900)&(trimmed_elements_df['value_season']>15)]

trimmed_elements_filtered.sort_values('value_season',ascending=False).head(11)[player_detail_cols]

Players with lowest matches played>> These players may be potential non-starters.

trimmed_elements_filtered.sort_values('matches').head()

Correlation between ICT Index & Total Points

The correlation between the ICT index and total points is really high. So it would be wise to take in players with high ICT index and High value- to optimize the total budget

trimmed_elements_filtered['ict_index'].astype(float).corr(trimmed_elements_filtered['total_points'])0.7391533890286993

Making a column called seaon_played to find out players with a good probability of playing

Some of the columns are in string format, we need to convert them to float format to make some calculations

trimmed_elements_filtered.points_per_game = trimmed_elements_filtered.points_per_game.astype(float)

trimmed_elements_filtered['matches'] = trimmed_elements_filtered['matches'].apply(lambda x: 38 if np.ceil(x)>=38 else np.ceil(x))

trimmed_elements_filtered['season_played'] = trimmed_elements_filtered['matches']/38

trimmed_elements_filtered.head()
trimmed_elements_filtered['matches'] = trimmed_elements_filtered['matches'].apply(lambda x: 38 if np.ceil(x)>=38 else np.ceil(x))

trimmed_elements_filtered['season_played'] = trimmed_elements_filtered['matches']/38

Checking goalkeepers with high starts in the season

trimmed_elements_filtered[trimmed_elements_filtered['element_type']=='Goalkeeper'].sort_values(by=['season_played'],ascending=False).head(5)

Players who have played the whole season

trimmed_elements_filtered[trimmed_elements_filtered['season_played']==1].sort_values(['ict_index_rank_type'])

Player positions with highest matches played during the season

trimmed_elements_filtered[trimmed_elements_filtered['season_played']==1]['element_type'].value_counts()Midfielder    6
Defender      3
Goalkeeper    3
Name: element_type, dtype: int64

The next step will be to build a budget optimizer and create a team with certain constraints. Will be back soon, stay tuned!