Graph Neural Networks for Multiple Object Tracking

15 min readSep 2, 2020

Multiple object tracking(MOT) is the task of studying object appearance and movements to analyze their trajectories. For a given input video the algorithm is supposed to output which portions of the image represent the same object in different frames of the video. Algorithms like these can be used to solve some exciting problems like analyzing a particular soccer player’s movements during the game, predicting whether a person is going to cross the street or not, or to track and analyze the movement of microscopic organisms in time-lapse microscopy images, etc.

Multiple object tracking solutions fall into two categories:

Online tracking — These algorithms process two frames at a time. They are quite fast which makes them perfect for real-time tracking. These algorithms cannot recover from occlusions or errors. A solution for this is to simply drop the current trajectory and start a new one.
Offline tracking — Processes a batch of frames. Since a batch of frames is fed as input to these algorithms they can recover from occlusions or errors. This certainly results in significant computational overhead thereby making them not suitable for real-time applications. These are widely used for video analysis applications.

In this article, we will go through a state of the art Offline tracking framework for solving the problem of MOT. The approach that we are about to discuss was published in a paper by the researchers at the Dynamic Vision and Learning Group at TUM. Their proposed algorithm achieved SOTA on MOT15, MOT16, and MOT17 challenges.

The authors of the paper propose a network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks(MPN)

Since MPNs are an integral component of this paper I’m gonna take a few minutes to walk you through what MPNs are and what they do. MPNs come under the realm of deep learning on graphs. Most computer vision happens in the image domain where the order of the pixels is important for the representation of an image and this property is exploited by CNNs to extract the most useful information from a patch of the input image. When we move to a different domain like 3D point clouds[figure 2] we cannot employ our favorite CNN based architectures to capture the most important features and relationships in our input 3D data. This is because point clouds are irregular, there is no sense of order, beginning or ending in them.

In a typical fully connected network(FCNs) or a CNN, we supply the input in the form of a flattened tensor to the network at the very first layer. In graphical models, we have nodes and edges which are embeddings of the information that we are interested in processing. In FCNs or CNNs we have the concept of layers and our input gets processed layer by layer until it reaches the loss function.

In the case of graphical neural networks(GNNs), there is no concept of a layer so naturally, layer-wise processing of information is absent. Instead, we have an “information propagation” step where a node gets updated by using the info from all neighboring nodes and edges. Once all the nodes in the graph are updated, the processing equivalent of one hidden layer is completed[figure 3]. The graph then goes through a ReLU and the process repeats till the output. If you look at the above figure you will notice that in the second layer the nodes receive information from the nodes that are 2 jumps away. In the third layer, the nodes receive information from nodes that are 3 jumps away, and so on. So, by adding more hidden layers you are receiving information form nodes that are further and further away. Remember that all the information units that are moved are embeddings. The output graph is a graph containing context-aware nodes and edges.

Notice that after ‘L’ iterations every node contains information of all the nodes that are at a distance ‘L’ in the graph. This an analogous form of the concept ‘receptive fields’ in CNN, allowing embeddings to capture contextual information. The animation in [figure 4] depicts this well.

Let’s make this a little more concrete with some equations.

G — graph with V nodes and E edges, h — embedding, this can be obtained from a CNN or an RNN or any other type of NN. The superscript for ‘h’ is the time step (information propagation step). h(i, j) is the edge embedding between nodes i and j represented as an edge (i,j). h(i) is the node embedding for the node i. The letter “l” represents message-passing steps.

This is the initial organization of the graph[figure 6]. Every node gets features(embeddings) from its neighbors. Yellow box — node embedding, green box — edge embedding

These embeddings are then combined using some order invariant operation like summation or averaging. The term “message passing” can be inferred from this figure. An order invariant operation is used because of the order-less nature of graphs.[figure 7]

Below is the equation that represents a message aggregation(“m”) combined with a learnable function [figure 8]. We provide a learnable function to the model so that it can learn which of the nodes are important or relevant and by how much. The order invariant operation here is a summation.

“m” is the message aggregated from the neighbors of “v” to update the node “v”. “u” is a neighbor of “v”. Neighbors of “v” are represented by “Nᵥ”. “M¹” is the learnable function like a NN(neural network)or an MLP(multi-layer perceptron) that takes three inputs — node embedding of neighbor, node embedding of the node to be updated, the embedding of the edge connecting “u” and “v”. All the node update messages follow this equation and the weights of “M¹” are shared across all the nodes.

The aggregated message that we just calculated will be used to calculate the new embedding for node “v” at step (l+1)[figure 9]. This completes the update of one node “v”. This is repeated on all the nodes at all information propagation steps. “U¹” is once again a learnable function like NN or MLP which shares weights with all node update operations.

The above two equations [figure 8 and 9] represent a general formulation for a message-passing network. Most message passing networks are an instance of this formulation. Let’s take an example.

Ex 1:

MLP₁ is a NN that determines how much of the newly aggregated message is important, MLP₂ determines how important is the embedding of the current node. Note that the learnable function can be MLP or CNN or RNN etc.

Ex 2: Graph convolutional networks

Note that there is a difference in the summation part[figure 10], in this example, we consider a node as a neighbor of itself (self-loop). There is also a change in the normalization term(denominator) of the message aggregation equation to account for this self-loop. “W” is a learnable function.

The network that we have discussed so far accounts only for node updates. To incorporate edge updates we need to add some more equations. This is done in two steps, Node to Edge update and Edge to Node update. This can be seen in the figure below[figure 11].

First, we perform “Node to Edge update”, where we generate a new edge embedding. This includes a learnable function “Nₑ” which takes inputs — node embedding for node “i” in the previous step, node embedding for the node “j” in the previous step, and edge embedding for the edge (i,j) from the previous step. Square brackets indicate that these three embeddings are concatenated before feeding to the learnable function[figure 12].

Next, we perform “Edge to Node update”, where we generate a new node embedding. First, we need to perform an aggregation to generate a message, next we generate the new node embedding by using the computed message aggregation. Here “Φ” is an order invariant operation like mean, sum or max, etc [figure 13].

This MPN formulation forms the heart of our MOT algorithm. With that brief intro to MPNs lets move on to the main focus of this article, multiple object tracking.

The authors of this paper tackle the MOT problem by using the tracking-by-detection paradigm. This two-step approach consists of first generating object detections using any of the popular object detection algorithms like Faster-RCNN, Mask-RCNN, etc. Then use these detections to perform data association using a graphical model. Every node in this graph represents an object detection and an edge between two nodes represents the relationship between the two detections. It is the job of our graphical model to capture higher-order features between two nodes(object detections) by performing message passing. Once the message passing is complete and the edges have converged to their final embeddings a binary classifier is used to perform classification on the edges. Edges belong to two classes, an active edge indicates that the two detections belong to the same trajectory.

A pictorial representation should make things clearer.

The above picture [figure 14] shows the output from a detector, frame by frame, starting from time step t, t+1, t+2,… These detections form the nodes of our graph.

Note that the performance of the proposed algorithm depends on the quality of the detector. All the nodes of the time step “t” are connected to all nodes at time step “t+1”. The job of our graphical model is to find which of these edges belong to the same trajectory.

That’s how our graph would look [figure 16] like after the algorithm has converged. The Red, Yellow, and Green edges represent edges that belong to a particular trajectory and the Blue edges are the ones that do not contribute to a trajectory. The task of dividing a set of detections into trajectories is grouping nodes in the original graph into disconnected components.

Now that the objective of this algorithm is clear let’s pose the problem statement more formally[figure 17]. “O” is the set of input detections, “n” is the total number of objects for all the frames of a video. Every single detection in “O” is represented by small “o”, o = (aᵢ, pᵢ, tᵢ), where aᵢ — raw pixels of the bounding box, pᵢ — 2D image coordinates, tᵢ — timestamp. A trajectory “Tᵢ” is defined as a set of time-ordered object detections that form a trajectory “i”. The goal of MOT is to find a set of these trajectories, T* that can explain the detections present in “O”. Let G = (V, E) be an undirected graph. V — nodes:={1,….n}, and E — edges.

To perform graph partition we use a binary variable “y ᵢ ⱼ” for every edge between nodes i and j. “y ᵢ ⱼ” is 1 if edge (i, j) belongs to the trajectory Tₖ and Tₖ belongs to T*, 0 otherwise[figure 18].

An edge is active when y ᵢ ⱼ = 1. The authors of this paper assume that a node cannot belong to more that one trajectory. So, “y ᵢ ⱼ” has some constraints on it [figure 19].

These constraints make sure that every node “i” belongs to V at time step “t” gets linked via an active edge to at most one node in time step t-1 and at most one node in the time step t+1. Every model based on network flows has some sort of constraint that guides the optimization procedure. This is the flow conservation principle used in this approach. A simplified version of the network flow framework is used in this paper. We will not be discussing network flows for MOT in this article but if you are interested I recommend you go through this paper.

Note: I’ve mentioned that the inputs to our graph are detections from the object detector, but this is not the full story. Our detections need to be processed into two particular components — appearance features and motion/geometric features. Our model needs to learn how the objects look like(appearance) and it needs to know the motion or position of the objects so that it can track its trajectory. So, point is that we need to provide appearance information and motion/geometry information to our graphical model. Appearance information is given to the nodes and geometric information is given to the edges.

Now it’s time to include the content from the MPN section. Given a set of detections the model is supposed to predict the values of the binary variable “y” for every edge of the graph. To achieve this task MPNs are used. There are four stages in the pipeline proposed by the authors of this paper.

Graph construction — given detections form the nodes of the graph and edges are the connections between the nodes.
Feature encoding — Appearance feature embeddings(node embeddings) are obtained from a CNN applied on the bounding box image. Geometric feature embeddings(edge embeddings) are obtained for a pair of detections in different frames. These are computed by feeding the relative bounding box size, position, and timestamp to an MLP.
Neural message passing — nodes share appearance information and edges share geometric information at each round of message passing. This provides updated embeddings at the end of every round.
Training — The edge embeddings generated at the last round of message passing are used to perform binary classification into active and non-active edges. Like all binary classification problems, this model is trained using a cross-entropy loss.

The entire pipeline can be depicted in the following manner [figure 20].

At test time we use the model’s predictions per edge(between 0 and 1) and round them off to obtain the final trajectories.

Remember that the goal of every message passing step is to learn the weights of the learnable functions present in MPN equations so that relevant information gets passed between the nodes and edges. The message passing procedure is the same as that of what we discussed except for a few modifications, let’ talk about those modifications now.

As discussed, this is the equation for a node to edge update.

Edge to node update,

Here ‘Nₑ’ and ‘Nᵥ’ are the learnable functions like MLPs. ‘Nᵢ’ is the set adjacent nodes to the node ‘i’. ‘Φ’ is an order invariant operation like sum, mean, or max [figure 21].

This what we have seen before now let’s build on top of this. One of the most important contributions of the authors of this paper is introducing Time-Aware Message Passing, a minor but essential tweak to the above equations.

This tweak explicitly encodes the temporal structure of the graph by separating the aggregation step into two parts. The first one is over nodes in the past and the second one is over nodes in the future. The below figure demonstrates how the separation of aggregation is done[figure 22].

That’s the whole time aware update. Notice that we are only concerned about changing the equation for edge to node updates. Also notice that now ‘hᵢ⁰’ (initial node embedding) is included in the message computation step.

We are towards the end of this article now and there are only two more topics left to discuss. 1) How are these the initial appearance and geometric embeddings generated? 2)How are the training and inference performed?

How are these the initial appearance and geometric embeddings generated?

Appearance embeddings — Remember that the global input to this MOT pipeline is the object detections from the detector. These RGB patches of the video frames are fed to a CNN to generate feature embeddings for every detection “oᵢ ∈ O”.

figure 24 source

That’s how the initial embedding ‘hᵢ⁰’ is calculated. “Nᵥ enc” is the CNN and “aᵢ” is the image patch.

Geometric embeddings — These are computed for edges, which means they are computed using a pair of detections in different frames. Here we make use of relative position, size, and distance in time for every pair of detections “oᵢ” and “oⱼ”, tᵢ ≠ tⱼ. We consider the bounding box coordinates, height, and width (xᵢ, yᵢ, hᵢ, wᵢ) and (xⱼ, yⱼ, hⱼ, wⱼ) for calculating relative distance and size.

Once these relative numbers are obtained we concatenate this with the time difference (tᵢ−tⱼ) and the relative appearance. Below the equation for relative appearance [figure 26].

figure 26 source

Once we have this concatenated vector with all relative measures in it we feed it to an MLP to obtain the initial edge embedding, “ hᵢ,ⱼ⁰ ”.

2. How are the training and inference performed?

An MLP with a sigmoid output is used for the task of edge classification. For every edge (i, j) ∈ E we compute its label by feeding the embeddings of MPN at a given message passing step “l”. Binary cross-entropy loss is used as an objective function.

‘y hat’ — predicted label, ‘y’ — true label.

At inference time we use the edge embeddings of the final message passing step to obtain predictions between 0 and 1 which are rounded off to the nearest integer to get the predicted label values.

Paper results

This table shows how time aware message passing update improves tracking performance w.r.t vanilla MPN [figure 28]. (Upward arrow for a score indicates higher the better, a lower arrow indicates lower the better)

Remember that elaborate formulation that we discussed for getting the edge embeddings (concatenation of time difference, relative appearance and, relative position), turns out that this is the best form of embedding for the MOT task [figure 29].

Based on what we have discussed, what’s your take on the relationship between the MOTA score and the number of message passing steps? The higher the message passing steps more contextual information gets captured hence higher MOTA values, right?

Well, that’s not entirely true. MOTA score saturates after a certain point. MOTA score saturates after just 4 message-passing steps! [figure 30]

Comparing with previous MOT algorithms[figure 31].

Phew! That was a pretty long article. If you are with me till here then I’m gonna assume that you are willing to go a little further, so I highly recommend you read the implementation details given in the paper. Thanks for reading :)

References and recommended reading

Original paper
MPN and graphical models
Network flows paper
Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker
CV3DST course at TUM
Neural message passing for quantum chemistry
Semi-supervised classification with graph convolutional networks
Relational inductive biases, deep learning, and graph networks

Graph Neural Networks for Multiple Object Tracking

Paper results

References and recommended reading

Written by Rishikesh Dhayarkar