Data integrity and why sub-effects in football may be over stated by bias in our data.

It’s a footballing truism that a fresh pair of legs can have a big impact on a game of football and this was borne out of the data two years ago by Colin Trainor over at StatsBomb. However this brief exchange with @SteMc74 on twitter and then the following post by James Curley got me thinking, why did my data differ from Ste Mc’s and what were the potential ramifications of this?

My first port of call was to investigate Aguero’s minutes played in the Premier League this season as my data differed from Ste Mc’s by 11 minutes over a ~810 minute sample.

In Aguero’s first game of the season, he came on as a substitute at The Hawthorns 62 minutes and 40 seconds into the game. Now key data providers in the public domain such as WhoScored, Squawka and TransferMarkt all credit Aguero with playing 27 complete minutes of football, but the game did not stop at 90:00 minutes rather it ended after 93 minutes exactly as reported by the BBC, who like WhoScored and Squawka use OPTA data but present it differently. This means that Aguero actually played 30 minutes and 20 seconds of football during the match which is about 11% more football than the big data providers report.

In Aguero’s second game, he came off as a substitute and everyone agrees he played 83 minutes of football. In his fourth game of the season against Watford, WhoScored et al agree he played the full 90 minutes, but really it was the full 92 minutes and 5 seconds as noted by the BBC. This pattern is repeated over and over, where some providers stop the clock at 90:00 whereas others acknowledge that the football does continue after 90 minutes to the point where one data provider says Aguero has played 805 minutes and another says he has played 816 minutes of football.

Some data providers only credit Lennon with having been on the pitch for just four minutes instead of the 13 minutes and 12 seconds he really played.

The most extreme example of this discrepancy this season was the Bournemouth versus Everton game which finally finished after 98 minutes and 24 seconds. In this game Aaron Lennon came on at 85:12 and saw three goals scored whilst he was on the pitch. Some data providers credit Lennon with having been on the pitch for just four minutes instead of the 13 minutes and 12 seconds he really played. They didn’t even credit him with the 48 seconds between 89:12 and 90:00.

Now there is a good reason for doing this scaling down as each football match is only supposed to last for 90 minutes of uninterrupted football, the added time at the end of the game is only there to compensate for stoppages in play due to injuries during the game. This is good for homogenising our data set for the vast majority of players who play ‘the full 90 minutes’, but affects the integrity of the data for both the set of players substituted off the field of play and the set of those substituted on in their place. Those substituted off early are not compensated for stoppages in play while they are on the pitch with added time at the end of the match, while those substituted on are playing football for longer than they are credited with.

Cutting Through the Noise

The importance of determining how many minutes a footballer played quickly becomes apparent when analysts summarily examine a player’s performance and they turn to the staple ‘per 90’ metrics. If you have your data homogenised to reflect each football match having 90 minutes of uninterrupted football then players who are substituted off early and don’t see their missed game time compensated will have their ‘per 90’ metrics deflated while those substituted on play more football than they are credited with and have their ‘per 90' metrics inflated.

This is clearly an issue, and if analysts can adjust for these effects then they can then evaluate players with a small sample size of minutes more confidently. This could be greatly beneficial in a sport that has a limited number of events even when players are playing complete seasons.

In Colin Trainor’s work linked earlier he looked at sub-effects in a variety of leagues and concluded that players substituted on or off the pitch saw their ‘per 90' metrics boosted as cohorts. It is clear from his tables that his data comes from homogenised 90 minute football matches whereas in reality the average game in the 2012/13 Premier League season that Colin looked at saw the clock stop at about 94:30. If we attempt to revert his cohorts’ total minutes played back to their true time on the pitch, the significance of the sub-effect changes noticeably. To do this I have added on four and a half minutes to the time played by each player who finished a match, i.e. those who played the ‘full 94:30’ and those who were substituted on and also played the added four minutes and 30 seconds at the end of a game. I have then created in the table below ‘per 94:30’ metrics for each of the ‘full match’, ‘subbed off’ and ‘subbed on’ cohorts to try and get a truer representation of sub-effects on a player’s ‘per 90’ metrics. This doesn’t completely negate the strength of sub-effects, I didn’t expect it to, but it does almost halve(!) the perceived benefit for players substituted on in this preliminary investigation. Furthermore it does improve the metrics for players substituted off to a degree.

Per90 goal statistics versus Per94:30 goal statistics.

For various reasons such as extra substitutions in stoppage time, I believe this crude data butchery goes too far and understates the importance of sub-effects with these adjustments but it does well to illustrate the importance of the integrity of your data. Ideally we would be able to pinpoint the exact moments where a football match was stopped and correctly apportion the live minutes to all 28 players featuring in a game of football but with unaccountable referees and no data collectors trying to determine the true amount of live football played we can’t do this. What we can do is be aware of the limitations of our data and try to account for it where possible.