Data Science Applied to Soccer: the xPSG Sports Analytics Challenge

Jérémi DeBlois-Beaucage
The Sports Scientist
8 min readJul 31, 2020

This article covers the 2019 xPSG Challenge and opinions on the future of sports analytics. The Python code used is available on GitHub, and this article is also available in French.

Great soccer (or football, sorry non North American readers!) teams are known to be on the cutting edge of training techniques, often aided by the latest technological advances. Last year, a leading European soccer team, Paris Saint-Germain (PSG), made public its interest in one of these technological innovations: sports analytics, or the application of data science and machine learning to the world of sport.

In March 2019, Paris Saint-Germain joined forces with École Polytechnique and launched the xPSG Challenge, with the grand prize of a € 100,000 doctoral grant for research at Polytechnique on the subject. The objective:

“To revolutionize football using your science skills: analyse a data set based on French football championship matches and prove that science can help improve sports performance.”

As a sports and machine learning fan, I took on this challenge, motivated.

The experience exposed me to certain possibilities of big data, artificial intelligence, data science and other buzzwords, applied to sports.

In this article, I present the main aspects of the challenge, and some thoughts on the future of sports analytics.

Contents

  1. The challenge
    - Tasks
    - Data cleaning and presentation
    - Problem solving: recurrent neural networks
    - Results
  2. Sports analytics: a world of possibilities
    - Real-time recommendations
    - Comparisons between players

The challenge

Tasks

For this challenge, PSG provided access to moment-by-moment, fully-detailed statistics of 2016–2017 Ligue 1 matches.

Example of the type of data available: detailed actions, with timestamp, position on the field, outcome, and team and players concerned.

Then, with 15 minutes of detailed statistics of a match not in the dataset, we had to predict:

  1. Which player is identified in the data?
    In the example above, the player in question made a long pass and a left shot, both in the opponent’s zone on the right side, so we could predict that he is a winger. Ángel Di María?
  2. What will be the position of the next action?
    The play ends in position (85,35), so we could predict that the next action will take place near this zone.
  3. Which team will be at the heart of this next action?
    The home team successfully made their last pass; we could therefore predict that the next action will be made by the home team.

Several different techniques are relevant for these tasks. For example, some tried to directly predict the player among all Ligue 1 players, while others preferred to predict the player’s team and position first.

I tried my luck on the three tasks. This article, however, focuses on the 3rd: predicting which team will make the first action after the 15 minutes of data. A similar methodology was used for the first two tasks.

First, let’s look at the data provided.

Data cleaning and presentation

The raw data is in XML format, one file per game. These files come from the supplier Opta Sports, world leader in sports data.

Raw data extract for a classic Lyon-Nantes.

The first step was to transform this raw data into .csv files, easier to process in Python.

Cleaned data in .csv format.

The only two programming languages ​​accepted in the competition were R and Python. I am using Python here, with the Keras deep learning library.

xPSG did not accept solutions on Excel. The reason? Excel would only be useful for passive-aggressive texting, a la Kelly Rowland in 2002.

In the end, 190 matches are available, with an average of 1800 events per match. These events can be of 50 different types (passing, shooting, foul, offside, etc.). There is also a list of more than 200 qualifiers for these events: for example a pass can be called a “chipped pass” or “head pass”.

Here is a 10-second sequence of a game between AS Monaco and Paris Saint-Germain, with the corresponding data.

We observe, in order: clearing header by a Monaco player, then central pass between PSG players, chipped pass, head pass and goal with a header, in the lower right part of the net.

Solving the problem: recurrent neural networks

Note: I do not have access to the same computational resources as during the competition; I therefore propose a lite version of the solution, trainable in a few minutes, which does not require more than a basic personal computer.

The architecture is as follows: first, only the last 10 events are kept, i.e. the 10 preceding the event to be predicted. Each event has about fifty numerical (latitude, longitude) or binary features (is the action successful? is it a pass? is it with the head? etc.).

This sequence of events with their features is entered into a Long Short-Term Memory (LSTM) type recurrent neural network. The output of this recurrent component is connected to a network with several hidden layers. The network output is a prediction between 0 and 1: a prediction of more than 0.5 indicates the home team for the next event, and a prediction of less than 0.5 indicates the away team.

This is a conventional deep learning architecture for time series.

190 matches were made available to us. Matches were divided into algorithm training (70% of matches), validation (10%) and test (20%) samples.

Results and conclusion of the challenge

Unfortunately, I did not qualify among the 20 finalists, and it is impossible to know what were the top scores.

Footage of me, not seeing my name in the top 20.

For this task, there were only two possible predictions: home team or away team. If the model predicts purely at random (Random), we can expect an accuracy of 50%.

Another technique would be to look at which team made the last action, and predict that this team will make the next action. This technique (Last Team Next) yields an accuracy of 63%.

To make the task more complex, the PSG censored a few attributes usually present in the data: the event “success” indicator, and the event qualifiers.

With the proposed model of recurrent neural networks, trained with the censored information, we reach an accuracy of nearly 77% (Censored Info). This precision rises to more than 89% when the censorship is removed, i.e. when the model is given the success indicator and the event qualifiers (Full Info).

The final results.

The performance is interesting: when having access to full information, an algorithm can correctly predict 9 times out of 10 the next team to make an action.

Interesting, yes. But concretely, how could this be useful?

xPSG was mostly aimed at “discovering talent in sports analytics”, and I doubt that winning algorithms would really be useful as they are. These algorithms represent a starting point towards much more interesting applications.

Sports analytics: a world of possibilities

From a technical point of view, this competition presented considerable difficulties. However, the use of big data in the world of sport presents very real possibilities.

Indeed, several similar initiatives have already been proposed. In baseball, the leading example is the 2002 season of the Oakland Athletics and general manager Billy Beane, featured in the movie Moneyball.

In basketball, the “bible” of sports analytics was published in 2004 (Basketball on Paper, by Dean Oliver), and advanced analytics are now ubiquitous in the NBA.

“The easy gains from using analytics in the N.B.A. have already been won. Each team now has an analytics staff, and coaching staffs are infused with analytics-speak. ” — The New York Times, 2019

Sports analytics are also gaining popularity in soccer, and PSG is one of the bigger teams exploring the field.

Here are two great possibilities that sports analytics could bring to soccer, in a more or less near future.

Real-time recommendations

In this case, we wanted to predict the team that would make the next event. Suppose that instead of predicting the team at the heart of the next event, we wanted to predict who will have possession of the ball next.

After a sequence of events, the player would be faced with several options: pass, dribble, cross, shoot. Which of the following options maximizes the chances of retaining ball possession in the next event? The machine’s prediction could be calculated in real time. Or, more realistically, one could compare after the game the decisions made by the player and those recommended by the algorithm.

What if instead of observing the possession of the ball, we rather observed what action optimizes the chances of having a shot on goal in the next 10 events?

Obviously, these predictions would only be based on observable data, and may not all be correct. But the action which, based on the algorithms computations, maximizes the chances of scoring a goal might be relevant; perhaps it would exhibit new ways of designing attacks or set pieces, comparable to Alpha Go in 2016.

“The Google machine made a move that no human ever would. And it was beautiful.” — Wired, 2016

While algorithms are still far from having the Kevin De Bruyne vision, they can already be useful, and will be so increasingly.

Comparisons between players

Players are often compared on measurable attributes: number of goals, successful passes, speed, etc.

Using big data, it would be possible to create a latent vector that represents a player’s contributions during a match. All passes, dribbles, shots, crosses, passed or failed, could be input data into a neural network that would create a lower-dimensional vector.

In natural language processing, we use word embeddings: each word is associated with a vector of real values, and two words with similar meanings will have similar vectors. Could this concept be applied for each player? The data presented in this article was only for “events” categorized by Opta Sports, but other data could be added, such as movements without the ball.

This vector could then be used to compare players with each other. If Di María gets injured, who should he be replaced with? We could look at which player is more “similar” to Di María, with measures of similarity between their vectors, for example.

Conclusion

From one sport to another, from one continent to another, professional teams are investing in sports analytics. This article detailed just one application of data science to soccer, but we may see more and more in the years to come. And congratulations to the winners of the xPSG Challenge!

Questions, comments, ideas? Feel free to reach out to me via LinkedIn!

--

--