Location, Location, Location

Lyon Van Voorhis
LyonStats
Published in
4 min readNov 4, 2018

Soccer is a game of positioning. Scoring goals is the primary objective and the best way to score goals is to move the ball into as advantageous a position as possible before having a shot. But where are these valuable positions? As I start to dig into the World Cup data I’ve gotten from StatsBomb (found here), I keep returning to this question. Today, I’m going to walk through my first attempt to calculate the discrete values of parts of the pitch, using R for the data manipulation and the ggplot2 package for my visualizations.

Again, the data I’m using comes from StatsBomb, and includes treasure trove of event data from all 64 World Cup games. They recorded things like shots, passes, tackles and all events include time, player, and player location, among many other more arcane details specific to the various play types, including StatsBomb’s expected goal calculation (this explains some of their methodology) for each shot attempt. This xG number is crucial to how I’ll go about determining locational value.

The first step I took was grouping each of the more than 11,000 possessions (a new possession starts every time a new team takes control of the ball) and summing the xG for each one. For possessions that didn’t end in a shot (and therefore had no xG value, I set the xG value to be 0.

Now that I had calculated the xG value for each possession, I then attached that possession value to every ball possession event in each possession (I defined a ball possession event as a Pass, Dribble, Shot, or Ball Receipt). StatsBomb gives each event a location on a 120 x 80 grid. This heat map shows the density of the 118,000 ball possession events that will be included in my model.

Lighter areas indicate more activity. The right side is the attacking side.

The activity pattern’s horizontal symmetry is striking, and based on this knowledge, and in order to increase the density of data points, I’m going to “fold” the field in half. I placed all events on a 120x40 grid, assigning the “xlocation” (confusingly enough placed on the Y axis in the visualization above) a value of 81-x if it is greater than 40.

With this done, I then calculated the mean xG for possessions that passed through all locations on the “folded” field. For example, if 10 ball possession events occurred at point 36, 58 and those possessions culminated in a total of one shot with an xG of .20, the locational xG of point 36, 58 would be .02. After “unfolding” the field, this is the map that results, with lighter color marking higher average xG:

Note the high density at the PK spot.

Clearly the highest-value real estate is right by the goal. Until you’re getting close to the box, there is very little change in color across the rest of the field. A note here: the blank spaces indicate that there was not a single ball possession event in that location (on either side of the field, thanks to the “folding”) over the course of the world cup. The penalty spot also jumps out; it turns out that StatsBomb tags penalty kicks as separate possessions. I’ve since removed PK possessions from my model.

To create a model that covers the entire field, I decided to “smooth” my results, using the values of neighboring points to adjust the value of every point. This was done to remove the effect of specific outliers, as the dataset is too small to remove all noise.

For each point, I added its xG to the total xG of the eight (or fewer if on the sideline or in the corner) points surrounding it and divided it by the count of ball possession events that took place in it or in a neighbor to create a smoothed value. This is the heat map that results:

Location Value v1

While definitely a v1, this gives us a good place to start from, and already, there are some clear takeaways. Obviously, possessions directly in front of the net are extremely profitable. The deep corners of the box are not particularly so.

This type of model opens up a lot of questions. Whose passes put their team in the best location to score? Whose dribble attacks advanced into the most dangerous areas? Who turned the ball over in the most dangerous places? I’ll continue to tweak the model, but I want to jump in and start to see what this model tells us about the 2018 WC.

All underlying event data used in this piece came from StatsBomb

--

--