Track a High Speed Moving Object with TrackNet

Published in

Geek Culture

6 min readOct 13, 2021

It is amazing how technology these days can fluctuate the flow of the game. Goal-line technology, VAR, and Hawk-Eye are all examples of how modern technology is being used in professional sports. The ultimate goal of these technologies is to prevent human error and promote unbiased calls from the referees.

Computer vision has come a long way and it offers various promising tools for players, coaches, and fans to further improve their games. For example, in tennis, the trait that makes apart great players from good players is their ability to handle different trajectories of incoming balls.

Thus, it would be better to watch the opponent’s film that automatically tracks the ball and projects a trajectory. This allows you to have a better understanding of your opponent’s ability to handle different balls in different circumstances and, hopefully, spot a weakness to attack during the game.

For this article, TrackNet, a CNN for tracking high speed and tiny objects, is introduced.

Personally, TrackNet is a great example of integrating computer vision in sports and the method the author presents is clever and straightforward.

As a preview, the video below is a result of TrackNet[1] on a short clip of me playing tennis. As you can see, the detection is fairly accurate, tracking the ball even when it is occluded.

TrackNet

This paper was introduced and published in the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveilance (AVSS )by National Chiao Tung University

TrackNet is a heatmap-based deep learning network that can not only recognize the tennis ball from a single frame but also learn flying patterns from consecutive frames, which enables the model to precisely estimate the location of the ball even when it is occluded from the players or objects.

The precision, recall, and F1-measure of TrackNet are 99.7%, 97.3%, and 98.5%, respectively, which is significantly higher than the conventional image processing method called, Archana’s algorithm[2].

Dataset

The main dataset comes from the broadcast video of the tennis men’s singles final at the 2017 Summer Universiade. The videos were edited to retrieve 20,844 game-related frames, starting from ball serving to score.

Kudos to the authors who were managed to label more than 20,000 frames by hand to train the model from the scratch.

Each frame was labeled with the following:

Frame Name: the name of the frame files.
Visibility Class: the visibility of the ball in each frame. 0) indicates the ball is outside the frame, 1) indicates the ball is within the frame, and 2) indicates the ball is in the frame but cannot be located with human eyes.
(X, Y): coordinate of a tennis ball in the pixel coordinate. Due to the high moving speed of the tennis ball, the ball might show up blurry, as is seen in Figure 1. In such a case, the latest position of the ball’s trace was considered as the position.
Trajectory Pattern: the ball movement types, which are classified into 0)flying, 1)hitting, and 2)bouncing.

Figure 1. The coordinate of the ball, which is at the latest position of the ball’s trace, is marked with red.

To prevent overfitting, 9 different tennis court settings (clay, grass, hard, etc.) were recorded to generate additional 16,118 frames and were labeled the same way. This helps the model to be generalized and performs better in many different settings.

The ground truth is the heatmap of an amplified 2D Gaussian distribution located at the center of the tennis ball. The variance of the Gaussian distribution refers to the diameter of tennis ball images, which the author assumed to be about 10 pixels. The equation of the Gaussian heatmap can be written as,

The first part represents Gaussian distribution centered at (x_o, y_o) with variance
The second part scales the heatmap to the range of [0, 255]

Figure 3 shows an example of the Gaussian heatmap.

Figure 3. Gaussian distribution at the center of the ball.

Model Architecture

TrackNet is composed of a convolutional neural network (CNN) followed by a deconvolutional neural network, as you can see in figure 2. Input is a multiple of consecutive frames. The author proposed that having more than one input frame can improve the model in spotting moving objects by learning the trajectory pattern.

Feature extraction is performed with VGG16 and spatial localization is done with DeconvNet.

The output right after the softmax layer represents the predicted heatmap of a tennish ball, which has the same dimension as their input image but with 256 channels. Each channel represents gray scale values ranging from [0, 255]. The predicted heatmap is further processed with softmax activation to output a probability distribution of depth k from possible 256 grayscale values. The channel of the highest probability is selected as the heatmap value of that pixel.

Then, a threshold of 128 is enforced to transform the grayscale prediction into a binary heatmap.

Finally, Hugh Gradient Method is used to filter valid frames by checking if there is only one tennis ball within the frame. If one tennis ball is detected, a center point is returned for that frame.

This center point is saved and later used to track the location of the ball.

Loss

Binary cross-entropy, H_Q(P), is used to train the model, where

G(i, j) is the Gaussian heatmap
Q(i, j, k) is a binary form of Gaussian heatmap
P(i, j, k) is the output right after the softmax layer

Results and Discussion

The author tested a total of four models to evaluate the performance of TrackNet:

Archana’s: Single frame input + Image processing technique
TrackNet Model I: Single frame input + CNN
TrackNet Model II: Three consecutive frames input + CNN
TrackNet Model II’: Three consecutive frames input + CNN + pre-trained weight

TrackNet clearly outperforms Archana’s algorithm in precision, recall, and F1-measure, achieving 95.7%, 89.6%, and 92.5%, respectively. Also, it is evident that using three consecutive frames achieves higher results than using a single frame. This further validates the author’s point that multiple frames give more trainable insights to the model on moving objects at a high speed.

Positioning Error

Positioning error is another metric defined by the authors to measure a proper specification for prediction error. The author defined positioning error as the following,

The mean diameter of a tennis ball is 5 pixels and the prediction error within a unit size of the ball does not cause misleading in the trajectory identification.
Thus, we defined the positioning error (PE) specification as 5 pixels to indicate whether a ball is accurately detected. Detections with PE larger than 5 pixels belong to false predictions.

Positioning error is calculated by the Euclidean distance between the model prediction and the ground truth.

Here you can see, TrackNet Model II (multiple frames) has a higher probability of making low positioning error predictions than TrackNet Model I (single frame) overall.

Conclusion

To sum up, TrackNet is a valid way to use computer vision to track a high-speed moving object. One of the biggest advantages of TrackNet is that it overcomes the issues of blurry and remnant images and can even detect occluded balls by learning its trajectory patterns.

Resources

Codes and paper are available on git for you to explore the tracker and to implement on your videos.

TrackNet code: https://nol.cs.nctu.edu.tw:234/open-source/TrackNet

Paper: https://arxiv.org/abs/1907.03698

References

[1] Yu-Chuan Huang, “TrackNet: Tennis Ball Tracking from Broadcast Video by Deep Learning Networks,” Master Thesis, advised by Tsì-Uí İk and Guan-Hua Huang, National Chiao Tung University, Taiwan, April 2018.

[2] Archana, M. & Geetha, M.. (2015). Object Detection and Tracking Based on Trajectory in Broadcast Tennis Video. Procedia Computer Science. 58. 225–232. 10.1016/j.procs.2015.08.060.