Multi-Agent Trajectory Prediction: Bringing deep learning to the beautiful game.

Published in

deMISTify

8 min readNov 14, 2023

Picture from https://www.thestandard.com.hk/section-news/fc/4/254535/Saved-by-the-ball

Introduction

Growing up as a fan of football, I almost never said no to a pick up game with friends after school. I was practically addicted to the game. For me, what makes the game so exhilarating isn’t just the satisfaction you get from dribbling past a defender, but also the ability to advance forward as a unit, making those crucial passes and creating space for your teammates to score as a team. When I was first exposed to the sport, I started to admire how players in top professional football teams, such as Barcelona, were extremely intelligent in the way they manipulated space and formed these networks to maximize passing efficiency and success rates. Every player was purposeful with their movement and how they influence the movement of others. With the perfect symphony of precise passes, clever movement and individual skill, you get a winning team that is to be feared.

However when it comes to understanding all the intricate variables that could lead to victory or defeat, it is certainly not an easy task. A lot of clubs are trying to better understand the patterns that naturally occur in a successful team through various statistical methods. With the rise of artificial intelligence, there has been growing applications of machine learning algorithms in helping teams learn strategies, player interactions and movement around the court during a game. In this article, I will talk about how authors of [1] used deep learning in the form of graph attention networks (GAN) and temporal convolutional networks (TCN) were employed for predicting the players position during a game.

The Problem

What is multi-agent trajectory prediction?

In simple terms, we want to predict how the players and the ball move.

We can treat each entity that we find interesting as an agent. In the context of sports games, each player along with the ball would be an agent. So for instance in basketball, we would say there are 11 agents in basketball (5 players per team x 2 teams +1 ball) or 23 agents in football (11 players per team x 2 teams + 1 ball). For a given time step t, an agent’s location can be represented by a 2-D vector x_k^t, where i ≤ K, K is the number of agents. Hence all agents can be represented by:

We can represent a series of snapshots of the agent’s location as a demonstration D_i:

Our full dataset can be represented as a collection of these demonstrations D:

Our goal is to predict the future states of all the agents M time steps into the future, where M is some constant.

Why perform multi-agent trajectory prediction?

The goal of multi-agent trajectory prediction is to forecast the movement of each of these agents some time steps into the future, given the past movement of all agents. For sports, being able to successfully predict the agent trajectories could become extremely useful in devising tactics, tailoring attacking strategies to exploit weaknesses of opponents, discovering partnerships between players and many more. These predictions could inspire small improvements in team play, potentially leading to the team securing those 3 points.

There are also other applications of multi-agent trajectory prediction such as autonomous driving. In this context, agents are now vehicles. Its difficulty lies in accurately predicting the behaviour of different agents based on the complex scene context (what is happening in the environment). Powerful predictive methods can help autonomous vehicles make informed decisions that keep the drivers safe. However we are going to focus on the sports application in this article.

Method

Introduction to proposed architecture

A new neural network architecture has been developed that combines the power of graph attention networks and temporal convolutional networks to to learn intricate spatial relationships of players and the evolution of movement and agent coordinates over time simultaneously. Let’s dive into how they work!

What are graph attention networks?

We can start by representing all the agents as nodes in a graph. A graph G with a set of nodes V and a set of edges E. We denote the 2-dimensional location of an agent k at time t with a vector x^t,k for t in R (time step of interest), and for k ≤ K (number of agents).

We can then denote a vector of agent i as h_i that describes the features of the agent in our graph, and our graph attention layer will receive h, the node feature matrix, as input for all agents and outputs a new learned node embedding matrix h’.

The key idea behind this graph attention layer is to weigh the importance of a nodes to all other nodes in the graph by computing a pair-wise attention score.

We can do so by performing a series of linear transformations with the node feature vectors hi and hj with learnable weight matrices W and a, and this score would represent the importance of node j’s feature to node i. I personally like visualizing the step by step matrix transformations so here’s a diagram to show it:

Figure showing the attention score computation in matrix form.

We then normalize all the scores across all neighbors of node i using the SoftMax function, and this principle is called masked attention: you are masking away unimportant nodes / not allowing nodes to attend to distant, unrelated nodes. Below shows the row-wise normalization with Softmax:

With the attention scores computed, we can compute new embeddings for agent i by aggregating the weighted node embeddings over the neighborhood of i.

Lastly we perform a pooling/aggregation step to create an overall representation of the entire graph by summing all the final agent embedding vectors into one graph embedding at some time t.

What are Temporal Convolutional Networks?

Now since we want to predict how the agents evolve over time in their movement, we would need to use a method of aggregating past information of the agent’s movement. The authors of this paper used a sequence model called Temporal Convolutional Networks, which are essentially stacks of 1-D dilated causal convolutional networks.

The model takes in an input sequence of graph embeddings that is generated from the attention block / spatial module:

and predicts a sequence of graph embeddings:

To refresh your memory on 1-D causal convolution, I made the following diagram to illustrate the idea. Every entry in the output sequence is resulted by convolving a kernel filter with the input sequence. (i.e. the last element of output sequence is the convolution of the last 3 elements of the input sequence with a filter of size 3) [3].

Figure showing 1D convolution; every 3 lines represents a convolution with a kernel of size 3

For a convolution to be causal, every entry in the output sequence must be dependent on entries in the input sequence that come before it. (i.e. Within my sequence, information at time t+1 should not influence what happens at time t or any time before it). In order to do so, zero-padding is applied, where zero-valued elements are added to the beginning of the sequence (boxes in grey).

But there is a limitation currently because ideally an element in the output should be contributed by all the elements that come before it in the input. (A player could result in this movement because of space created by another player a number of time-steps ago or a decision made by a teammate at the beginning of the game, not just 3 time steps ago for example). This is where dilation comes in, which expands the size of the kernel by inserting gaps in between consecutive elements.

Figure showing 2 layers of 1D convolution with; first layer has 1 dilation, second layer has 2 dilations (1 gap in between consecutive elements in the kernel filter)

If you stack 1-D convolutional layers with an exponential number of dilations in each layer (the why is covered in this article), you will be able to form a TCN where full history coverage is achieved, meaning each element of the output is influenced by all the elements that come before it in the input.

Figure showing overall TCN architecture; each block represents a series of 1D convolution layers with exponentially increasing number of dilations (d) with base 2.

The authors also used non-linear activation functions to introduce non-linearities to the outputs for learning more complex representations. Residual connections and dropout was also used as a regularization technique to combat overfitting.

The Final Model

Putting everything together, you get a graph attention encoder that encodes the spatial relationships between all agents into a global graph embedding at each time-step. And a sequence of these embeddings gets passed into the TCN decoder to learn the temporal dependencies and predict a sequence of future graph embeddings. These graph embeddings are later passed to a fully connected layer and converted back to x,y coordinates.

Figure showing overall architecture of the multi-agent trajectory prediction model

Results

Their results show that for a majority of the metrics, this model showed promising performance in predicting the agents trajectory by a significant margin for mixed teams . The authors used L2 error , max error as well as miss rate as their evaluation metrics. They then observed the performance of their model on a football dataset as well as a basketball dataset where each demonstration was 10 seconds.

Table showing the result comparisons between SOTAs and the author’s model tested on both basketball and football datasets.

An interesting result that they reported was that attention-based techniques showed higher performance in predictions with specific teams compared to predictions with mixed teams. This could possibly be because of attention having the ability to attend to event specific patterns for a specific team.

Next Steps

As machine learning methods continue to advance, there are going to be more and more opportunities to improve sports analytics. Although the authors did not explore other intentions beyond trajectory prediction, I personally think the possibility for deep learning models to learn abstract representations of tactics is getting closer and closer, even generative AI models for creating new attacking plans. I surely see these models being extended outside of basketball or football to inspire new strategies for other sports driven by artificial intelligence.