Evaluating Players in Association Football

Whose really the best in soccer? For a sport as popular as association football is, I’m surprised there isn’t a lot of numbers being used in its analysis. So I’ve gone ahead and tried my best impersonation of Bill James in coming up with one way to evaluate players.

Note: You can check out the python notebook on which this medium article was based on Kaggle. Because it comes straight from that notebook with only some formatting changes, the footnotes and such are a bit off compared to my more usual posts.

Photo by Fancy Crave on Unsplash

Sabermetrics is baseball statistics taken to an extreme, leveraging computers to evaluate players and teams in often unconventional ways. You can see this in the Oakland A’s 2002 season when they won an at-the-time record of 20 wins. The team and their approach is immortalized in the book Moneyball by Michael Lewis.

Looking at Association Football or Soccer — as I’ll call it — often has these stats being left out in favor of more traditional methods of evaluating players. I decided to crack open a recent Kaggle dataset and take a deeper look at trying to evaluate these players.

First, we’ll start with goals

Tracking the number of goals scored is an important starting point, but it doesn’t happen enough in a game to be statistically significant. A top ranking team in the English Premier League like Manchester City only scored 80 goals in the entire 2016–2017 season. That comes out to ~2.11 goals a game. Given the size of a squad, evaluating players solely on their goal production seems an inefficient way to judge them, especially if they don’t score often.

A better way to evaluate a player is to look at how often he attempts a goal and how often that particular shot-on-goal converts into a real goal. There should be enough data on attempts on goal to make this a meaningful number. It’s similar to the Corsi statistic used in evaluating hockey players.

So this is a two fold notebook: first is to look at how shots convert into goals (and goals into games). Second is to look at what types of shots on goal resulted in goals what percentage of time.

From this, I could evaluate players based on their ability to take those types of shot. My intuition was that players who take shots on goal that rarely convert into goals are to be valued less than players who take shots on goal that are likely to convert.

Note that this is far from a perfect analysis. For instance, defensive players are not evaluated here (the data I have does not record defensive maneuvers so they are more difficult to evaluate). I also took a lot of assumptions in order to make this work, which I’ll note when I can as the analysis progresses.

Thus, don’t take it as an end-all-be-all of soccer offensive player analysis, but instead take it as a starting place to look at player contributions in soccer teams.

# Imports -- get these out of the way
# It's considered good Pythonic practice to put imports at top
from random import randint
import os
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
from IPython.display import display

First Steps

First we need to import all of our data. I’ll assumed you grabbed the data from Kaggle, you’ll need an account to do so.

I like Pandas for data manipulations done locally, so I’ll be using that. It’s a library that works similar to R dataframes.

# Get Data and local file dictionaries
events = pd.read_csv("events.csv")
games = pd.read_csv("ginf.csv")
#--> This is a python-converted set of dictionaries from 'dictionary.txt' that I made
from event_dicts import *

Next we need to pull out the shots from the events. All attempts on goal are recorded in the ‘event_type’ column as the integer 1.

# Now we need to parse out _just_ the shots made from the events dictionary
shots = events[events['event_type']==1 ]

To start, we’ll first look at how many goals per game are needed on average by the winning team. We’ll assume, and then verify* that this corresponds to an approximate gaussian.

* (at least visually, I know this isn’t as good as verifying for real but this is a fun analysis not an official one!)

goals_winning_team = []
# go row by row (don't think a one liner would work)
for I,game in games.iterrows():
if game['fthg'] > game['ftag']:
elif game['fthg'] < game['ftag']:

avg_goals = np.mean(goals_winning_team)
std_goals = np.std(goals_winning_team)

print("Average Goals Per Game from Winning Team: %0.3f" % avg_goals)
print("Std. of Goals per game from winning team: %0.3f" % std_goals)
print("%% Deviation: +/- %0.3f%%" % ((std_goals/avg_goals)*100))
Average Goals Per Game from Winning Team: 2.393
Std. of Goals per game from winning team: 1.170
% Deviation: +/- 48.881%
# Plot this all
plt.hist(goals_winning_team, bins=10, normed=True)
Distribution of Goals Scored for a Winning Team

As expected, the distribution skews. Soccer is largely a low-scoring game after all. We can say, with ~48.9% likelihood, that it takes 2.393 goals to win a game. Thus, players who contribute that many goals above the average player in a season are to be valued as 1 “win above replacement.” This idea of evaluating above or below the average in the league was a method pioneered by Bill James and you can read more about it here.

Next, we’ll look at the average number of shots that conver into goals.

# Average Shots that turned into Goals?
print("Shots: %d" % shots.shape[0])
print("Goals: %d" % shots[shots['is_goal']==True].shape[0])
print("Shots that convert: %0.2f%%" % (100*(shots[shots['is_goal']==True].shape[0]/shots[shots['event_type']==1].shape[0])))
Shots: 229135
Goals: 24441
Shots that convert: 10.67%

We could use this to say that each player’s shot on goal contributes 0.1067 goals, but that’s actually not very useful. It makes the big assumption that all shots are to be treated equally, whether long range from the middle of the field or close up for a penalty kick.

Lucky for us, our dataset differentiates the different types of goals available. For instance, we can pull up the location on the pitch from which the shot was made. By picking apart these different factors, we can start to get a better sense as to what each shot on goal is worth and evaluate the shots differently.

So we’ll start by first removing all shots in which the location was not mentioned, replacing the integer values in the CSV with the dictionary provided (that I converted separately), then concatenating them together.

# remove shots where the location is not recorded
shots = shots[ shots['location']!=19. ]
# Replace all integer columns with places on the pitch
shots.replace({'location': location_dict,
'shot_place': shot_place_dict,
'bodypart': bodypart_dict,
'assist_method': assist_method_dict,
'situation': situation_dict}, inplace=True)

# Can one do multi-line commands in iPython notebooks? Not sure...
shots['uc'] = shots['location'] + ', ' + shots['shot_place'] + ', ' + shots['bodypart'] + ', ' + shots['assist_method'] + ', ' + shots['situation']

Next, we’ll get all the shots on goal for the unique combinations of words that come up. This unique combination will be it’s own column in the datfarame and will be an amalgamation of the first columns mentioned above (this was technically done in the last slot). As a guess, we’ll say that we want at least 100 occurrences of that particularly unique shot. * We’ll then evaluate all the other low occurence shots as being 0 in goal contribution. **

* This isn’t very rigorous as a way to do this, but as a first approximation I’d venture to say it’s good enough.

** This is a really poor first approximation. If I was to do this a second time around, I’d take all other types of unique shots and evaluate them as an other category.

unique_combos = shots['uc'].value_counts().to_dict()
unique_combos_that_scored = shots[ shots['is_goal']==True ]['uc'].value_counts().to_dict()
print("Total Unique Combos: %d" % len(unique_combos))
# We filter out keys that did not score
scoring_keys = [k for k in unique_combos.keys() if k in unique_combos_that_scored.keys()]
unique_combos = {key:unique_combos[key] for key in scoring_keys}
print("Total Unique Combos that Scored: %d" % len(unique_combos))

# We'll also filter for just the ones with at least 100 entries
# TO-DO: only return ones that have 100 _and_ at least 5 occurences of a goal
unique_combos = {key:value for key,value in unique_combos.items() if value > 100}
print("Number of Unique Combos: %d" % len(unique_combos))
items = sorted(unique_combos.items(), key=lambda x: x[1], reverse=True)
# Although this will look ugly, it will give us a sense of how much data we have
plt.bar(range(len(unique_combos)), [i[1] for i in items], align='center')
plt.xticks(range(len(unique_combos)), [i[0] for i in items])
# What's the most frequent combination?
print("%r: %d" % (items[0][0],items[0][1]))
Total Unique Combos: 3488
Total Unique Combos that Scored: 1076
Number of Unique Combos: 120
Shots on Goal according to unique combination
'Outside the box, Centre of the goal, right foot, Pass, Open play': 3209

Lastly, we’ll build out the conversion percentages that each unique combo scores on. We’ll look at how many times that unique combo was a goal, how many times it was a shot on goal, and then divide the number of goals by the times it was a shot on goal. That number will be our rough estimate as to the contribution of that ‘event’ as a fraction of goal with some error. *

* Not sure how to find that error at this moment. Anybody have any ideas?

# We then look to see what the conversion percentages are on our unique combinations
shots_per_uc = unique_combos
goals_per_uc = shots[ shots['is_goal']==True ]['uc'].value_counts().to_dict()
l = [];
for k in shots_per_uc.keys():
s = shots_per_uc[k]
g = goals_per_uc[k]
f = float(g/s)

split_uc = k.split(", ")
location = split_uc[0]
shot_place = split_uc[1]
bodypart = split_uc[2]
assist_method = split_uc[3]
situation = split_uc[4]

df = pd.DataFrame(l,columns=['location',
df.sort_values(axis=0, by=['percentage_conversion'],
ascending=False, inplace=True)

We’ll then display the dataframe and grab the most common values:

Shots on Goals, ordered by % they converted to goals

Note: I am not the world’s biggest soccer fan so I actually don’t know what much of this means. Is it surprising that headers from corner kicks into the bottom right corner score 100% of the time in 214 shots? Keep in mind this set is a collection of 9,074 games across the division 1 European soccer league going back to ~2008. *

So what does this mean to a soccer player/manager/coach? It means that the types of events they should try and make happen are the ones that more-than-liekly convert into goals. This might be a matter of tactics, whereby the squad tries its damnest to create opportunities like this. It might also be a matter of training drills, whereby teams train on these sorts of conversion events to make it rote memory to attempt when playing.

You could break these down for opposing teams as well to see what types of goals they are likely to perform (or not perform). Players can be broken down this way too. **

* Minus the Russian Premier League I believe. The information on the Kaggle page was a bit insufficient.

** I actually want to look at outlier teams and players, but that will be for another post.

Analyzing Players

Let’s now look at players.

Players, especially on the offense, contribute goals that contribute to wins. Nothing but goals mattering is a key assumption here and an inaccurate one, but a place to start.

First, we’ll manipulate our dataframes to get a dictionary of values. Each conversion percentage counts as a fraction of a goal. We’ll use that to evaluate every shot made in our dataset as a fraction of a goal. From this, we can tally up player’s contributions.

Note again that I assume all contributions that there is insufficient data to judge for are ranked as a big fat Zero (0). This is not the best way to do this, but again as a starting point it’ll be fine.

# We'll rename our df from before into something more manageable
events = df
# Build that unique combo column back up.
events['key'] = events['location']+'/'+events['shot_place']+'/'+events['bodypart']+'/'+events['assist_method']+'/'+events['situation']
other_conversions = pd.Series(events['percentage_conversion'].values,index=events['key']).to_dict()
# Then we'll rank all the other events as zero as there's 
# insufficient information
shots['key'] = shots['location']+'/'+shots['shot_place']+'/'+shots['bodypart']+'/'+shots['assist_method']+'/'+shots['situation']
conversions = pd.Series([0 for i in range(shots['key'].shape[0])], index=shots['key']).to_dict()
# Then we combine them -- weird naming convention is to make this smooth
# We assign value to every contribution
shots['value'] = shots['key'].apply(lambda x: conversions[x])
# Get all players into a list -- players who are not mention (i.e. NaN) are dropped
players = shots['player'].dropna().unique().tolist()
player_contributions_per_game = {}
player_games_number = {}

# Loop through the players and get their total contributions
for player in players:
A = shots[ shots['player']==player ]
games_played = A['id_odsp'].unique().tolist()
    contributions = 0
for game in games_played:
contributions += A[ A['id_odsp']==game ]['value'].sum()

# We normalize the contribution such that it's _per-game_
player_contributions_per_game[player] = float(contributions/len(games_played))
player_games_number[player] = len(games_played);

You can probably see that this will nix players who do not record a shot on goal. This will eliminate many defensive position players (though not all). As a first proxy for just getting offensive players, I think it’s satisfactory.

# Now we'll plot the average player's contribution
anonymous_contributions = list(player_contributions_per_game.values())
Histogram of Player Contributions
# So let's look at players who contribute more than 0 to get 
# a better sense of our data
A = [x for x in list(player_contributions_per_game.values()) if x > 0.0]

plt.hist(A, bins='auto')
Player Contributions without zero-contributors

Interesting about this data — the vast majority of players contribute nothing.

Also of note: almost no players contribute more than 1.0 goals per game.

This indicates to me, a layman, that soccer is about developing and finding superstars. It confirms a suspicion I have that, although a team sport, it’s a supporting team sport: the team exists to support one or a group of star players. It’s similar to cyclcing and basketball (the latter I can’t quite confirm through numbers, but feel it intuitively).

We can take this two ways now: either find the average (mean) player or the middle most (median) player. I personally think the median player will be more useful as it will eliminate outliers — the top players that score a lot will lopsidy the distribution. You can see that to some degree in the plot above as there are outliers that live above ~0.4. This will pull the average up higher than it should be.

Nonetheless, I’ll pull all numbers and you can get a base of comparison as you like it.

war = np.median(anonymous_contributions)
print("Median Player's Contribution: %f" % war)
print("Mean Player's Contribution: %f" % np.mean(anonymous_contributions))
print("Mean Player's Contribution (zeros dropped): %f" % np.mean(A))
Median Player's Contribution: 0.065267
Mean Player's Contribution: 0.088393
Mean Player's Contribution (zeros dropped): 0.106900

Now for fun, let’s look at the top players by contribution!

We’ll filter out people who have played less than 10 games as they likely have insufficient data to properly judge.

# Let's look at the top players in terms of contributions 
B = [(v,k) for k,v in player_contributions_per_game.items()]
B = sorted(B, key=lambda x: x[0], reverse=True)
l = []
for i in B:

if player_games_number[i[1]] < 10:
#print("%r: %f, played %d games" % (i[1],i[0],player_games_number[i[1]]));

df = pd.DataFrame(l,
Top players by Contribution per Game

As you can see, the top players are in line with whom we’d expect. Cristiano Ronaldo is considered by many to be the best ever so having him at the top makes a lot of sense.

The one outlier of Jorginho Frello having played only 11 games in this dataset makes sense as he doesn’t have enough data to properly train. I actually find it surprising he’s the only outlier here.

Also not surprising is that the majority of players listed are offensive position players. We wouldn’t expect to find defensive players here nor goalie; I don’t believe either is actually in the dataset.

Final Thoughts

So what do you make of this?

I think sports is most fun when you pretend to play manager, at least for someone with a more intellectual bent. Evidently a lot of people agree. So one could use this data to evaluate offensive players with a little bit of working to eliminate my poor assumptions.

You could also use it to evaluate future rookies. If a particular rookie scores more often from some position that is unexpected, that might be a sign he should be played differently than another player.

Formations could possibly be evaluated this way as well. Certain formations will favor certain goal scoring opportunities, so you could play your formation differently for your team or against another team who favors some particular scoring opportunity. Alternatively, your players might be outliers who score better under some circumstances vs others. I’m not sure if there is necessarily enough data in this set to do that, but I’m sure there is enough data somewhere to do that.