A Meta-analysis of DAVIS-2017 Video Object Segmentation Challenge

DAVIS-2017: a multi-instance Video Object Segmentation challenge

In the previous post: Video Object Segmentation — The Basics, we’ve gone through the problem definition of Video Object Segmentation, its metrics and nuances. Then we’ve covered the two main approaches that emerged to deal with the DAVIS-2016 video object segmentation dataset: MaskTrack and OSVOS. In this post we’ll see how these algorithms evolved to handle the more challenging DAVIS-2017 dataset.

In terms of accuracy, there was a significant leap in performance in 2017. For reference: OSVOS, the state of the art for 2016, got a Region Similarity score of ~46 on the 2017 challenge (with our best implementation) while this year’s winner achieved the impressive score of 67.9!

Breaking down the top 9 published works in the DAVIS-2017 challenge

Out of 22 participating teams, here are the top 9 with published results:

Breaking down the top 9 published works in the DAVIS-2017 challenge

Looking at the table, we can already see some trends forming in 2017:

  • All of the leading works are based on either MaskTrack or OSVOS.
  • On the DAVIS-2017 challenge, MaskTrack has won.
  • Lucid Data Dreaming augmentations are becoming popular (read more below).
  • About half of the works have upgraded their base network to RESNET.
  • Almost everyone used some form of temporal component, leveraging the tendency of consecutive video frames to be similar.
  • About half of the works made use of a semantic component, employing semantic segmentation or detection (bounding box) networks in their solution.

Before jumping to any further conclusions, let’s take a closer look at some of the leading works in the following sections:

Lucid Data Dreaming for Object Tracking

An important work from the authors of the original MaskTrack, Lucid Data Dreaming aims to “change the mindset regarding how many training samples and general “objectness” knowledge is required to approach this problem”.

  • The authors generate “in-domain” training data from DAVIS-2017 dataset and the first frame annotation for each video. For the per-video fine-tuning, they synthesize 2500 augmentations from the single annotated frame, which represent plausible future video frames.
  • They achieve these extreme augmentations by “cutting-out the foreground object, in-painting the background, perturbing both foreground and background, and finally recomposing the scene. This process is applied twice with randomly sampled transform parameters, resulting in a pair of frames (Iτ−1, Iτ ) with ground-truth pixel-level mask annotations (Mτ−1, Mτ ), optical flow Fτ , and occlusion regions, as the undergoing transformations are known.”
  • On DAVIS-2016, they show that forgoing pretraining on imagenet hinders their results by a mere 2~5%, so using ONLY the first frame annotation as training data actually achieves competitive results! I for one was surprised by this result.
  • They use the DeepLabV2 architecture, tweaked to a process a bigger amount of input channels: [3(rgb)+n(amount of instance masks)+1(optical flow)+1(semantic segmentation)], trained for each video separately.
  • Another benefit of lucid dreaming is the ability to fine-tune the postprocessing CRF per video.
Lucid Data Dreaming pipeline

Note: since some of the other DAVIS contestants, including the #1 winners cite Lucid Dreaming, in my opinion this work deserves the honorary first place.

Video Object Segmentation with Re-identification

The winning work of the DAVIS-2017 challenge based itself on MaskTrack and Lucid Dream for short-term object tracking, while adding a re-identification process that recovers lost instances in the long-term.

  • They’ve made improvements to the MaskTrack method:
    - The base net is now RESNET.
    - Instead of feeding the entire image to the mask propagation network, they are only giving it a cropped, size-normalized patch corresponding to the object’s bounding box, thus improving the tracking performance for small objects.
The improved MaskTrack architecture
  • Re-identification process. For every frame:
    - Detect candidate bounding boxes with Faster-RCNN, then compare them to the known instances to check if they should be “recovered” using [Joint detection and identification feature learning for person search], retrained for the general object case.
    - The recovered instance is then forward and back propagated into the past and future video frames.
The re-identification process, with forward and backward propagation

Multiple-Instance Video Segmentation with Sequence-Specific Object Proposals

This was the highest ranked team based on the OSVOS method. Their idea was to generate “object proposals” from various sources and then combine them in a post-processing step.

  • An improved OSVOS is used to find “object proposals”.
    - Added a small improvement to the balanced-cross-entropy loss function to better handle convergence for small objects (which appear in the DAVIS-2017 dataset)
    - The online training is performed on Lucid-Dreaming-style augmentations of the first annotated frame
  • Additional (semantic) proposals are acquired from an instance-aware image segmentation algorithm
  • Makes use of “combinatorial grouping” to find connected components of the suggested proposals given by OSVOS and decide which ones to keep or discard according to distance-based criteria.
  • Performs tracking and deals with occlusions using a Semantic Proposal Tracker which is based on these two papers.

Online Adaptation of Convolutional Neural Networks for the 2017 DAVIS Challenge

The only team (so far) to release their code, onAVOS took the online fine-tuning idea to the next level by segmenting future frames with a model fine-tuned on previously segmented frames.

onAVOS training pipeline
  • They switch the base net to RESNET (based on Wider or Deeper), pre-trained on PASCAL-VOC
  • For the online training, they generate augmentation of the first frame in a Lucid Data Dreaming style.
  • As each frame is segmented, foreground pixels with high confidence predictions are taken as further positive training examples, while pixels far away from the last assumed object position are taken as negative examples. Then an additional round of fine-tuning is performed on the newly acquired data.
  • In other words, the second frame is segmented by fine-tuning on the first frame annotation, then the third frame is segmented by training on the second frame’s prediction as well and the segmentation model keeps being updated with every new frame.
  • The final transfer learning chain is long and looks like this: 
    base net (imagenet) → objectness net (pascal voc) → domain specific (DAVIS) → test net (video) → online adapted test net (rolling fine-tune on high confidence pixels)

Looking at the code, the authors seem to have paid a lot of attention to implementation details such as supporting low memory GPUs and providing multiple running configurations.


Going over the works of the 2017 video segmentation challenge, some common strategies have emerged:

  • Combine MaskTrack, which is good at tracking objects in the short term but can sometimes lose the object of interest, with a re-identification mechanism that restarts the tracking.
  • Combine OSVOS, which is good at segmenting the object of interest but has difficulty handling extreme appearance changes, with a rolling fine-tune mechanism that updates the underlying model as the video progresses.
  • Upgrade OSVOS with the addition of an instance separation mechanism, such as a semantic instance segmentation network.
  • Upgrade to a modern base network (e.g. Resnet) and compensate for the small amount of data with extreme in-domain augmentations (e.g. Lucid Data Dreaming).

What’s next?

This year we have seen two main approaches being expanded upon and improved. Next year, will we see completely new architectures?

One promising direction could be the use of memory modules: unlike other video tasks, segmentation is a dense problem and thus the method of using a CNN to extract sparse features from each frame and then feed them as input to a separate RNN network does not work well. We have seen some interesting papers which suggest convolutional memory modules for unsupervised video segmentation. Will we see them applied to the semi-supervised case as well?

While video object segmentation has come a long way in the last couple of years, it is still far from “solved”. For the technology to become useful in real-life scenarios, great improvements need to be made in accuracy, generality and most importantly — in run-time speed. Several of the leading solutions of the 2017 competition can take hours to segment a single 5 second video, and none claim to do it fast.

We hope that our release of the more simple and focused GyGO dataset will encourage attempts at more efficient, perhaps even real-time solutions.

See you at next year’s challenge!

Go to part 1: Video Object Segmentation — The Basics…


The main papers described and analyzed in this post are cited below.

  1. Lucid data dreaming for object tracking.
    A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele.
    arXiv preprint, arXiv:1703.09554, 2017Papers 1–8
  2. Video Object Segmentation with Re-identification
    X. Li, Y. Qi, Z. Wang, K. Chen, Z. Liu, J. Shi, P. Luo, C. Change Loy, X. Tang, The 2017 DAVIS Challenge on Video Object Segmentation — CVPR Workshops, 2017
  3. Instance Re-Identification Flow for Video Object Segmentation
    T.-N. Le, K.-T. Nguyen, M.-H. Nguyen-Phan, T.-V. Ton, T.-A. Nguyen (2), X.-S. Trinh, Q.-H. Dinh, V.-T. Nguyen, A.-D. Duong, A. Sugimoto, T. V. Nguyen, M.-T. Tran, The 2017 DAVIS Challenge on Video Object Segmentation — CVPR Workshops, 2017
  4. Online Adaptation of Convolutional Neural Networks for Video Object Segmentation
    Paul Voigtlaender and Bastian Leibe: BMVC 2017
  5. The rest of the DAVIS-2017 contestant papers can be found here: