How Microsoft Does Video Object Detection -Unifying the Best Techniques in Video Object Detection Architectures in a Single Model

Published in

Nurture.AI

5 min readDec 8, 2017

In the field of deep learning & AI, significant advances have been made towards detecting objects in still images. Video object detection however, has received much less love even though it’s much more applicable in practical scenarios.

Earlier this year, the research team over at Microsoft Research Asia released a paper that focuses on unifying various techniques in video object detection in a single model. In their paper, they unified the state-of-the-art techniques applied to video object detection right now. These techniques attempt to overcome the challenges faced in video object detection.

Object Detection from Images to Video

Using the power of deep Convolutional Neural Networks, object detection in images has consistently achieved high accuracy scores across various state-of-the-art implementations. While things on the image side of object detection is all fine and dandy, the same cannot be said for object detection in videos.

Conceptually, Video Object Detection can be broken down into 2 steps:

Extracting a set of convolutional feature maps over the input via a fully convolutional backbone network. This network is usually pre-trained on the ImageNet classification task and fine-tuned later.
Generating detection results upon the feature maps by performing region classification and bounding box regression over either sparse object proposals or dense sliding windows via a multi-branched sub-network (called detection networks).

Fundamentally, object detection in videos is much more challenging. A model has to deal with new challenges not present in still images. Like deteriorated or obscured appearance of objects (e.g., motion blur, obstruction). Furthermore, if you were to directly apply image-based models on every frame of a video, you’d have to deal with the huge amounts of computational power to process every single video frame.

Still image Object Detectors are gonna have a hard time identifying the obscured objects like this ninja cat

Recent works on video object detection tried different ways to deal with them. Notably, sparse feature aggregation (paper) addresses the redundancy of features and data in consecutive frames to reduce computation and run-time duration. The idea is that consecutive frames usually contain similar information about an object. If we only consider key frames that contain significant feature indicators, we’ll be available to reduce the number of frames to process. Sparse feature aggregation reduces the run-time duration but it is still prone to errors resulting in a lower mean average precision.

Dense feature aggregation (paper) on the other hand looks to improve feature quality on frames as well as detection accuracy. The underlying intuition is that deep features are impaired by a deteriorated appearance across frames. During inference, a feature network is evaluated on all frames. For any frame, the feature maps for all frames within a handful of consecutive frames are warped onto the frame for evaluation. Dense feature aggregation increases the mean average precision (mAP) of the model, it does however slow it down considerably.

While both methods focus on improving different aspects, they share the same principles: motion estimations are incorporated to the network architecture and end-to-end learning of all modules is performed over multiple frames. Thus they are complementary in nature. The key difference between the 2 techniques is that the former evaluates sparse key frames (e.g., every 10th frame) whereas the latter considers all frames as key frames.

Bringing it all together

Visualization of the techniques applied to the model. Image taken from Toward High Performance Video Object Detection

Building on the strengths and drawbacks of the 2 baseline techniques above, the model is extended with 3 additional methods:

Sparsely Recursive Feature Aggregation
Evaluates the feature maps with the feature network and applies recursive feature aggregation and approximation only on sparse key frames.
Spatially-adapted Partial Feature Updating
Applying the idea of feature propagation for non-key frames, flow networks are used to estimate how good a particular feature is before updating subsequent frames with the feature maps.
Temporally-adaptive Key Frame Scheduling
Rather than naively picking key frames at an interval, careful consideration has to be given for which frames can be considered key-frames. Qualities like a good feature maps in a frame are evaluated to ensure that the key frames selected yield the best features for object detection. To decide on key frames, a heuristic that observes changing appearances of large motion and an oracle scheduling policy is used. The oracle exploits ground-truth information to select the best key frames.

Large movements and motion are not good feature indicators for objects

Network Architectures

Alongside the aforementioned techniques, 3 types of sub networks are used:

Flow network
Based on this paper’s work, a flow network is used to estimate motion and optical flow. It’s used in this model to anticipate where features will appear in frames by predicting the movement of pixels across frames. A flow network uses Convolutional Neural Networks which are capable of solving the optical flow estimation problem as a supervised learning task.
Feature network
A slightly modified ResNet-101 model. By removing the ending average pool and fully connected layers, the model can be fine-tuned to the target objects. This network is what does the bulk of the processing on the features.
Detection network
State-of-the-art Region Proposal Networks and Region-based Fully Convolutional Networks detect objects in frames.

Summarizing it, Microsoft’s approach is:

Figure out the key frames using temporally-adaptive key frame scheduling
Recursively aggregate features of non-key frames into the sparse key frames
Approximate a feature’s contribution as an object indicator and tune the amount of partial feature updating done on each frame, controlling the amount of feature propagation

The paper mentioned that their methods were successful in efficiently managing computational costs while maintaining accuracy. Achieving a mAP of 77.8 and 15.2 fps on a Nvidia Tesla K40 GPU. Compared to the winning entry of ImageNet VID challenge 2017, Microsoft’s model achieves a 1% increase in mAP while only sacrificing 0.2 fps.

Scrutinizing the approaches used by the research team in Microsoft Research Asia provides a good insight on building your own video object detection model. Compared to image object detection, this is definitely an area that could see more practical usage once it achieves a consistent level of performance.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Find this useful? Feel free to smash that clap and check out my other works. 😄

James Lee is an AI Research Fellow at Nurture.Ai. A recent graduate from Monash University in Computer Science, he writes about interesting papers on Artificial Intelligence and Deep Learning. Find him on Twitter at @jamswamsa.

How Microsoft Does Video Object Detection -Unifying the Best Techniques in Video Object Detection Architectures in a Single Model

Object Detection from Images to Video

Bringing it all together

Network Architectures

Written by James Lee