How Important Is Defense?

Eric Schmidt
Analyzing the World Cup using Google Cloud
4 min readJul 13, 2018

Authored by: Ramzi BenSaid

Welcome back to our analysis of the World Cup using Google Cloud. We’ve previously written about topics like our data coverage, expected goals models and a whole bunch of games. In this post we are going to talk about the features we’re using for player based model, how we created them and which ones have proven to be the most powerful when it comes to predicting game outcomes.

Before we dive in here — our player related features are built on the foundation of our team ratings, detailed here if you’re in need of a refresher.

We’ll start by summarizing what we have in terms of data across each league we have coverage for:

  • Goals (per game per player)
  • Expected goals (per game per player)
  • Team ratings (offensive and defense, per game per team)
  • Positional alignment
  • Minutes played

A quick Google search of “adjusted plus minus soccer” will find articles highlighting the holes this metric has from a predictive standpoint, so we knew we needed to do more than adjusted plus/minus. We were also wary of going forward with something like adjusted xG plus/minus, so we set out to build a player rating system.

We started with our offensive and defensive team ratings as a baseline and distributed those scores amongst the players on the field.

Offense

From the offensive standpoint the scores were broken down with the following stats:

  • Personal goals scored
  • xG of shots taken
  • Minutes played

In the end, every field player received a portion of their team’s score based on the amount of the game they played. The remaining score was distributed to players based on their contribution to the team’s goals and xG. This score was then multiplied on a league coefficient so that a particular score in the Champions League would outweigh that same score from a game in the English Championship.

Defense

Defense provided a different challenge as we don’t have individual defensive stats for the majority of leagues we cover. Instead a team’s defensive score was broken with the following:

  • Minutes played
  • Positional alignment

Again, every field player received a portion of their team’s score as a function of how much time they played. However the secondary break here was based solely on a player’s position. To do this, we looked at the 23 possible formations as qualified by Opta and assigned each position to defense, midfield or forward. This is far from a perfect science as certain coaches use particular players or formations totally differently. The case could be made that N’Golo Kante is more important for France defensively than any of their back four — but that’s a conversation for a later date.

Goalkeeping

Our goalkeeping score was built independent of the team scores — team scores are a function of a team’s xG, among other things, but with goalkeepers we are only concerned with the xG that reaches their net. We built an expected goals model specifically to assess goalkeeper performance used those results here. The goalkeeper score is a function of xG faced and goals conceded.

Our player ratings were calculated with a combination of aggregations in BigQuery and python. Player ratings were then loaded into a new BigQuery table where we could turn them into game prediction features.

Creating Features

First we created various moving averages of each player’s offensive and defensive ratings at the following window sizes:

  • Last 3 games
  • Last 5 games
  • Last 10 games
  • Last 20 games
  • Entire career

Now rather than look at each player in the starting XI as a separate group of features, we chose to assess offensive and defensive stats in the same positional groups used for creating the defensive ratings; goalkeepers, defense, midfield and forwards. Positions of a team’s starting lineup were looked up and assigned to one of the aforementioned groups and then the moving average of all players in that position group was calculated. In the end we had over 70 player centric features to explore belonging to seven different categories.

With all of these features assembled, we shuffled our data and split into training and testing sets. Next we scaled our features (all but the binary is_neutral field).

We then iterated through our features to order them in terms of importance. Each subsequent feature added was to a list of the previous selected features and used to a fit a multi-output logistic regression model using scikit-learn. As we did with our team related features, we then ran our model on the remaining test set and evaluated performance by taking the high predicted probability (t1_win, t2_win or tie) and deemed it correct if that result happened.

The most predictive feature was the ten game moving average defensive score of the defense. This metric alone correctly predicted 43.0% of games in our test set. The next few ranked stats were different time series of the same stat, so instead of listing them out, we’ll show the ranks of the stat categories:

  1. Defense’s defensive scores
  2. Midfield’s offensive scores
  3. Midfield’s defensive scores
  4. Strikers’ offensive scores
  5. Goalkeeping scores
  6. Defense’s offensive scores
  7. Strikers’ defensive scores

We were a bit surprised that strikers’ offensive scores wasn’t in the top 3 because an elite goalscorer can go out and win any given game — but if you don’t concede goals you can’t lose the game. What’s that old saying about defense winning championships?

--

--