Published in


How open is English Premier League football data?

Having discovered Project Red Card, I started thinking about who owns performance data from football matches? And how is it captured and processed, to come up with something useful? Fairly quickly I found myself plotting some maps using freely available data from StatsBomb — but how did I get there?

My first pass map, generated with StatsBomb

It’s funny where browsing can lead you. I was Googling Russell Slade, one of our old managers at Brighton and Hove Albion, and came across something he’s involved in called Project Red Card. This is an action group which argues that under UK and EU data protection laws, players’ performance data is actually personal data, which requires players’ explicit consent for clubs and leagues to use. The argument is that this consent may not yet been given, and that ownership by clubs and leagues has just been assumed.

Who owns player data?

Of course, the counter argument is that players’ performance data is not personal. Their on-pitch performance, in terms of position, touches of the ball etc., is part of their work as clubs’ employees and so falls under the employers’ ownership. It also exists in the public domain once viewed, be that live or via broadcast, regardless of what intellectual property (IP) rights and copyright restrictions exist over how that data is processed or distributed.

What’s certainly a greyer area, in the UK at least, is biometric data. Individuals and clubs within many sports already use this, most often harnessed via wearables, to understand the impact of performance. However, it’s not yet something clubs or leagues have universally tried to claim ownership of, with a view to commercialising, as far as I know. In the US it’s different, as both the national American Football (NFL) and Basketball (NBA) leagues have a level of ownership over this data, more so in the NFL.

The Premier League’s (PL) IP

Licensing of the PL’s IP is managed by Football Dataco, a company which is owned by the Premier League and the English Football League. It additionally manages licensing of the Scottish Football League’s intellectual property too.

As far as the Premier League goes, rights to collect, licence and distribute data is split two ways, data for betting and data for media. Up until the start of 2019–20, both of these were managed by the Stats Perform group, who own Opta (media) and RunningBall (betting). However the betting side is now managed by BetGenius, part of Genius Sports, who won a 5 year deal in 2019 for an undisclosed sum. This agreement involves a minimum guarantee plus revenue share.

What game play data is tracked in Premier League matches?

Data currently collected and distributed within the Premier League is largely, according to its website, focussed around who has the ball. Of course, in the modern game, success often derives from what teams do when they don’t have the ball. This underscores one of the most beautiful elements of Football; the best, strongest, most talented team doesn’t always win, which is more or less usually the case in other team ball sports. The example that always springs to mind for me on this was Celtic vs Barcelona in 2012’s Champions League. With 27% possession, 134 completed passes to Barcelona’s 865, 2 corners to their 7, and 4 shots on target to their 14, Celtic won 2–1.

Creating more meaningful data

Occurrences like this are what seemed to drive StatsBomb to want to do something different. (Even though I think I’ve read that the CEO, Ted Knutson, has said people thought he was mad to do so, given how tightly the data and distribution of data is managed, along with its wide reach and acceptance.)

We devised and built a brand new, proprietary dataset because we knew we could do better than what was out there!

Much of what they are currently developing is focussed on tracking how the team without the ball presses the team in possession. They also do more to try and make the xG (expected goals) metric more robust, by looking in more detail at the anatomy of shots and the positions of attackers, defenders and goalkeepers.

What I’m interested to learn is more about how StatsBomb harvest this data, as I’m sure they don’t get the same sort of stadia access that StatsPerform or GeniusSports get. Current knowns, unknowns and assumptions are as follows:-

  1. Data is harvested from televised footage. The potentially hugely scaleable way they do this is described here. What I don’t understand with this is whether the act of doing so amounts to creating a derivative work, which might then be an infringement of the rights’ owners’ copyright.
  2. Licensing Opta data. This is implied in the following article, which references an Opta dataset. If this is the case, my guess would be that Statsbomb licence this in via OptaPro.

Opportunities for Statsbomb within British football leagues

With British (perhaps even other major EU) league partnership options seemingly narrow, the focus looks to be on individual clubs, enabling those clubs to use the data to help teams recruit better, improve performance and more effectively analyse opponents. If the metrics they enable clubs to create analysis from sits on top of a different, richer dataset, then that club potentially gets an advantage over any other club that uses the same data as everyone else. Statsbomb also point out that the data alone can’t drive differentiation; people are key too, as is alluded to here and here.

Learning how to work with Statsbomb’s open data using R

I’m a strong believer in companies opening up access to their data and code to help others learn and improve, so I was interested to find that Statsbomb have made a level of data openly available to work with, with comprehensive documentation provided on the data spec. As such I was keen to jump in and have a play.

There are a few useful resources on how to use this on the StatsBomb site, here, here and here. However, I worked from BiscuitChaserFC’s guide here. The most painful part of the process for me was in getting devtools in R Studio working. I don’t run Windows. On MacOS I kept getting an error when I tried to install StatsBombR via devtools. In the end I got it working on Ubuntu 20.04, despite having to work through a whole load of dependency issues. From there on in, running the code was pretty easy.

Take a look at this 9-minute video which takes you through the process of building a pass map with the StatsBomb data.

I’d be interested in comments on this article, especially any thoughts around Project Red Card and players’ rights to performance and biometric data, as well as how IP and copyright of match data for football is managed.

Digital decisions are never a walk in the park, so please get in touch and let me help you find the right way through the technical landscape.




There’s only one choice that’s the right one — an informed one

Recommended from Medium

Same Same, but Different — My Journey as a Product Analyst from the 8200 unit to AT&T

AirBNB Seattle Analysis

Analysts Make the Case for Small-Cap Biotech Stocks This Fall (INMB, DVAX, SAVA, ICPT, IMGN, ATRX)

Analysts Make the Case for Small-Cap Biotech Stocks This Fall (INMB, DVAX, SAVA, ICPT, IMGN, ATRX)

Vulcanites of Vulcanverse: Notus

Restaurants Are the Riskiest Places You Can Go Right Now

Why Bayesian optimization picks candidates that aren’t predicted to perform well

Why every Data Scientist should use Dask?

Two Minute Walking Test Experiment

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Dad and Husband who loves the great outdoors. Own @miggle, digital product management consultancy.

More from Medium

Renaming a GCP instance

Dynasty Dreamers: Garrett Wilson-Pre Draft Film Study

Singkong dan Keju — Singapore v Indonesia