Interactive Segmentation with Convolutional Neural
Networks

Arnaud Bénard
Gifs.com Artificial Intelligence Blog
5 min readDec 19, 2017

Recently animated stickers have increased in popularity due to their massive use in messaging applications or memes. Still, with existing tools, generating animated stickers is extremely challenging and time-consuming, making the task practically infeasible for non-experts. Removing the background of an arbitrary video (no green screen) is a menial task that involves manually segmenting the object in each frame of a video.

Example of an animated sticker from a video

At gifs.com, we decided to tackle this problem and help people create animated stickers easily by using AI.

Challenge

Automated animated sticker generation is a challenging problem to solve because of the complex nature of videos: they are subject to motion blur, bad composition, and occlusion. An object can be hard to segment due to its complex structure, small size (very little information) or large similarity between background and foreground. Also, a video clip can contain multiple objects, and we need to make sure the users extracts the object they are interested in.

Preview of user generated stickers

Our solution

First, the user uses our interactive object segmentation tool to mark the object of interest in the first frame of the video. Then, the result will be propagated to the other frames and rendered as an animated sticker. For segmenting the object, i.e. instance segmentation, we use Computer Vision techniques that can infer the full segmentation from minimal user input.

Example of using the interactive tool to annotate the first frame

Both segmentation steps (first frame and full video) rely on Convolutional Neural Networks, a type of a deep learning model. Deep learning is a good fit for our problem because of its recent improvements in Computer Vision. Convolutional Neural Networks have shown exceptional performance for image and video recognition. Those algorithms are capable of “understanding” the visual concept of an object (animal, car, …) in an image.

Next, we present the two steps of our method in more detail.

Interactive segmentation

A quick way to implement interactive segmentation is to use the GrabCut algorithm. It builds a model of the pixel distribution (colors) and performs well when the background and foreground are distinct, but outputs sub-optimal results when both are similar.

Even with multiple user annotations (left), the GrabCut result (right) is not satisfactory because the bear fur is a similar to the ground.

To get a beautiful sticker, we need a high-precision segmentation on the first frame. Because we were not satisfied by the GrabCut results, we decided to develop our method based on the latest research in deep learning. Inspired by recent work in interactive object segmentation with deep neural networks, we built a model that takes the image, the current segmentation result, and the user corrections as input and outputs a binary mask of the object.

We provide a brush tool to the user for correcting the first image of the video. Based on our production data, we have found that typical users tend to draw with a variety of patterns such as clicks, strokes or highlighting the whole object. Thus, we needed our algorithm to take into account a diversity of annotations and decided to include simulated strokes and clicks during the training phase to get the best results and give the user a great experience.

Examples of the 3 annotation types: clicks, strokes and highlights

Video segmentation

After annotating one frame and successfully segmenting the object, we use a deep learning model based on the OSVOS paper to generate the segmentation in the other frames. OSVOS (One-Shot Video Object Segmentation) is a convolutional neural network (based on VGG) that uses generic semantic information to segment objects. For each sticker, the model is fine-tuned on frame/mask pairs. Then, we will infer the masks for all the frames in the video and combine the results to output an animated sticker with a transparent background.

If the object is fast moving or changes a lot towards the end of the video, we can get variable results. Thus, we allow the user to refine more frames in the video to improve the quality of the sticker.

Adding user corrections helps the model to improve the segmentation
If the result from annotating one frame is unsatisfactory, the sticker can be improved by providing corrections to the worst frame(s). Sticker generated with one annotated frame (left) and with multiple annotated frames (right).

Conclusion

We have shown how to make animated sticker creation practical, by using deep learning techniques. First, the user needs to annotate the first frame of their video, then the segmentation gets propagated to all frames using OSVOS and finally we let the user further refine their animated stickers if needed. Through the significant simplification of the creation process, our editor makes sticker creation easy and allows even non-experts to create awesome stickers.

Hundreds of stickers are created every week, go try it out!

Overview of our solution

Acknowledgments

Building this project was a great experience, and I would like to thank everyone in the gifs team, Michael Gygli for his work on deep interactive segmentation and editing, Montana Flynn for making the editor easy to use, the folks at the ETH CV lab for their work on OSVOS.

Links

https://beta.gifs.com/sticker-creator/

arXiv paper

Based on:

--

--