Single Object Tracking using the Siamese Family of Trackers - Part 1 - SiamFC

Published in

Alegion

4 min readAug 26, 2021

Introduction

Object tracking is an inherently compelling computer vision task where the challenge is to associate target objects in consecutive video frames as well as localize them. An initial “seed” or an initial bounding box is generally provided at the start of the video sequence and the tracker is expected to track the region-of-interests(ROIs) with a certain level of accuracy.

With the advent of deep learning, especially Convolutional Neural Networks, there has been an exponential progress in the domain of image classification and subsequently object detection. Research in Object Tracking is rapidly advancing, albeit not on par as Object Detection, as it is benefiting from the data driven approaches and algorithmic thinking for understanding localization, employed generally in Object Detection tasks. There are 2 major tracking challenges -

Single Object Tracking — comprising of short and long term tracking challenges and hosted yearly by VOT challenge
Multi Object Tracking — hosted yearly by MOT challenge

This first article in the series, predominantly will focus on Single Object tracking, using SiamFC tracker which is the first in line under the Siamese family of trackers. Our next posts will take a deep dive to its successors SiamRPN and SiamRPN++

Siamese Networks

A Siamese Neural Network is a class of neural network architectures that contain 2 neural identical subnetworks running in tandem. These parallel subnetworks share the same weights and parameter space. Siamese Networks get their name from the Siamese co-joined twins and generally have a Y-shaped neural architecture indicating a comparative approach.

The unifying theme in all Siamese architectures is this- we have 2 input vectors that we wish to compare, so we pass both of them through the same subnetwork configuration to obtain a multi-dimensional embedding representation. This embeddings are then trained on a certain loss function( like L2 loss or triple loss) to measure semantic similarity between them.

SiamFC- Architecture

Traditionally the object tracking scene has been mostly dominated by kernel based tracking ( like KCF, mean-shift etc) and contour based tracking like Conditional Density Propogation (Condensation). These algorithms deterministically learned the features online while clearly being data deprived to do any offline learning as most of them were proposed in the pre-deep learning era. This resulted in compromising localization accuracy which manifested poorly especially in instances of occlusion, change of camera angle and illumination and hence it was hard to predict the state space of the ROI even when combined with filtering approaches like Kalman and Particle filters.

The beauty of Siamese-styled tracking approach is that it can fully leverage the juxtaposition approach needed to localize an ROI by its inherent design which is the fundamental idea of tracking-

Given an object and its location in current frame, find the location of the same object in the next frame

SiamFC uses 2 identical CNN’s to address an offline similarity learning problem and then this function is then evaluated online during inference phase. As shown in the above figure, we are essential training the network to locate an exemplar image, denoted by z, within the larger search image x. In this example, the red and blue pixels in the score map contain the similarities for the corresponding sub-windows.The FC in the SiamFC stands for fully convolutional architecture with respect to the search image and so there is no restriction on the size of the test image.

Mathematically, the central idea here is to learn a function 𝑓(z, x) that compares the exemplar image z with the candidate image x and outputs a high scalar valued score map in case they are similar and low score otherwise. In other words, we are essentially building a class-agnostic similarity scoring function between 2 image patches. 𝑓(z, x) can be viewed as a composite function 𝕘( 𝝋(z), 𝝋(x)) where 𝝋 is the identical mapping function that takes the variable sized input vector and creates an embedding of both the exemplar and candidate vectors. 𝕘 in this case can be considered as a similarity metric or just a distance metric.

Training and Testing

ImageNet 2015 Video dataset was used to train the model which contained more than 30 different classes for animals and vehicles with more than 4500 videos and more than 1 million annotated images. During training the exemplar image size is fixed to 127 x 127 while the candidate size is 255 x 255 pixels. Positive and negative training pair images are carefully curated (i.e they are extracted from the same video and are usually at most 𝕋 frames apart), before training and a logistic loss is used as the cost function.

(y, v) = log(1 + exp(−yv))

Here v is the score of the exemplar-candidate and y is the ground truth either +1 or -1. Stochastic Gradient Descent optimizer is applied on the loss function to find the best parameters for the model.

Conclusion

SiamFC, released in 2016 and its improved version SiamFC conv5 baseline won the VOT-17 real-time challenge and was the state of the art in 2016–2017. As we always aim for better results, there were still some fundamental flaws in its design which were hindering it from being a really stellar object tracking method. This quickly changed and we will be seeing it in the next post where we discuss SiamRPN