The Basics of Video Object Segmentation

7 min readSep 12, 2017

Annotated ground truth frames from the DAVIS-2016 video object segmentation dataset

[Update: the article has also been translated to Chinese.]

This is the first in a two part series about the state of the art in algorithms for Video Object Segmentation. The first part will be an introduction to the problem and it’s “classic” solutions. We will briefly cover:

The problem, the datasets, the challenge
A new dataset that we’re announcing today!
The two main approaches from 2016: MaskTrack and OSVOS. These are the algorithms upon which all other works are based.

In the second part, which is more advanced, I will present a comparison table of all the published approaches to the DAVIS-2017 Video Object Segmentation challenge, summarize and highlight selected works and point to some emerging trends and directions.

The posts assume familiarity with some concepts in computer vision and deep learning, but are quite accessible. I hope to make a good introduction to this computer vision challenge and bring newcomers quickly up to speed.

Introduction

There are three classic tasks related to objects in computer vision: classification, detection and segmentation. While classification aims to answer the “what?”, the goal of the latter two is to also answer the “where?”, and segmentation specifically aims to do it at the pixel level.

Classical computer vision tasks (image from Stanford’s cs231n course slides)

In 2016 we have seen semantic segmentation mature and perhaps even begin to saturate existing datasets. Meanwhile, 2017 has been somewhat of a breakout year for video related tasks: action classification, action (temporal) segmentation, semantic segmentation, etc. In these posts we will focus on Video Object Segmentation.

The problem, the datasets, the challenge

Assuming reader familiarity with semantic segmentation, the task of video object segmentation basically introduces two differences:

We are segmenting general, NON-semantic objects.
A temporal component has been added: our task is to find the pixels corresponding to the object(s) of interest in each consecutive frame of a video.

This can also be thought of as a pixel-level object tracking problem.

Segmentation: sub-divisions of the space. An example dataset is given for each leaf in the graph.

In the video formulation, we can split the problem into two subcategories:

Unsupervised (aka video saliency detection): The task is to find and segment the main object in the video. This means the algorithm should decide by itself what the “main” object is.
Semi-supervised, which we’ll cover in these posts: given the ground truth segmentation mask of (only) the first frame as input, segment the annotated object in every consecutive frame.

The semi-supervised case can be extended to multiple object segmentation as can be seen in the DAVIS-2017 challenge.

The main difference between annotations of DAVIS-2016 (left) and DAVIS-2017 (right): multi-instance segmentation

→ Press here to view a video of the DAVIS-2017 dataset and challenge ←

As you can see, DAVIS is a dataset with pixel-perfect ground truth annotations. It aims to recreate real-life video scenarios such as camera shake, background clutter, occlusions and other complexities.

There are two primary metrics to measure segmentation success:

Region Similarity is the intersection-over-union between mask M and ground truth G

Region Similarity: defined as the intersection-over-union between the estimated segmentation and the ground truth mask.

Contour Accuracy is the F-measure for the contour based precision and recall

Contour Accuracy: interprets the masks as a set of closed contours and computes the contour-based F-measure which is a function of precision and recall.

Intuitively — Region Similarity measures the amount of mislabeled pixels, while Contour Accuracy measures the precision of the segmentation boundaries.

Announcing a new dataset! GyGO: E-commerce Video Object Segmentation by Visualead

GyGO, which we will release in parts over the following weeks, is a dataset that is focused on a specific, simple use case for video object segmentation: E-commerce. It will consist of about 150 short videos.

https://github.com/ilchemla/gygo-dataset

On the one hand, the sequences are quite simple in that they are virtually devoid of occlusions, fast motions or many of the other complexity inducing attributes mentioned above. On the other hand, the objects in these videos have more varied categories than the DAVIS-2016 dataset, in which many of the sequences contain known semantic classes (humans, cars, etc). The GyGO dataset specializes in smartphone captured videos and its frames are relatively sparse (the annotated video speed is ~5 fps).

GyGO E-commerce Video Object Segmentation Dataset: Teaser

We release it publicly with two goals in mind:

There is a severe lack of data in the space of video object segmentation at the moment. With only hundreds of annotated videos, we believe every contribution has the potential to increase performance. In our analysis we have shown that a joint training on the GyGO and DAVIS datasets yields improved inference results.
To promote a more open, sharing culture and encourage other researchers to join our efforts :) The DAVIS dataset and the research ecosystem that grew it have been massively useful for us. We hope the community will find our datasets useful as well.

The two main approaches to DAVIS-2016

With the release of the DAVIS-2016 dataset for single object video segmentation, two main leading approaches emerged: MaskTrack and OSVOS. Looking at the contestants of the DAVIS-2017 challenge, it seems that every single team decided to build their solution on top of one of these two approaches — giving them the status of instant classics. Lets see how they work:

One Shot Video Object Segmentation

The concept behind OSVOS is simple yet powerful:

Take a net (say VGG-16) pre-trained for classification for example, on imagenet.
Convert it to a fully convolutional network, à la FCN, thus preserving spatial information:
- Remove the FC layers in the end.
- Insert a new loss: pixel-wise sigmoid balanced cross entropy (previously used by HED). Now each pixel is separately classified into foreground or background.
Train the new fully convolutional network on the DAVIS-2016 training set.
One-shot training:
At inference time, given a new input video for segmentation and a ground-truth annotation for the first frame (remember, this is a semi-supervised problem), create a new model, initialized with the weights trained in [3] and fine-tuned on the first frame.

The result of this process is a unique, one-time-use model for every new video that is overfitted for that specific video according to the first frame annotation. Because for most videos the appearance of the object and the background does not change drastically, this model produces good results. Naturally, this model would work less well for another random video sequence.

Note: the OSVOS method segments the frames independently — there is no use of temporal information in the video.

MaskTrack (Learning Video Object Segmentation from Static Images)

While OSVOS works on each frame of the video independently, MaskTrack also takes into consideration the temporal information contained within it:

For each frame, feed the predicted mask of the previous frame as additional input to the network: the input is now 4 channels (RGB+previous mask). Initialize this process with the ground truth annotation given for the first frame.
The net, originally based on DeepLab VGG-16 (but modular), is trained from scratch on a combination of semantic segmentation and image saliency datasets. The input for the previous mask channel is artificially synthesized by small transformations of the ground truth annotation of each still image.
Add an identical second stream network, based on optical flow input. The model weights are the same as in the RGB stream. Fuse the outputs of both streams by averaging the results.
Online training: Use the first frame ground truth annotation to synthesize additional, video specific training data.

Note: Both of these methods rely on still-images training (as opposed to video, where the datasets are scarce and small).

To summarize, in this introductory post we’ve established the problem of Video Object Segmentation and the leading methods to solve it in 2016. Armed with this knowledge, we are ready to tackle the twice improved, state of the art algorithms proposed in 2017.

Go on to part 2: A Meta-Analysis of DAVIS-2017 Video Object Segmentation Challenge…

P.S. I’d like to say thank you to the wonderful team behind the DAVIS datasets and challenge for all their hard work. Without you none of this would exist.

References

The main papers described and analyzed in this post are cited below.

Benchmark Dataset and Evaluation Methodology for Video Object Segmentation
F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, Computer Vision and Pattern Recognition (CVPR) 2016
The 2017 DAVIS Challenge on Video Object Segmentation
J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, arXiv:1704.00675, 2017
Learning Video Object Segmentation from Static Images
F. Perazzi*, A. Khoreva*, R. Benenson, B. Schiele, A. Sorkine-Hornung
CVPR 2017, Honolulu, USA.
One-Shot Video Object Segmentation,
S. Caelles*, K.K. Maninis*, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool, Computer Vision and Pattern Recognition (CVPR), 2017