Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural
Networks

Michael Gygli
Gifs.com Artificial Intelligence Blog
4 min readSep 19, 2017

Shot boundary detection (SBD) is an important component of video analysis, as it is used in many video applications such as automatic highlight detection, action recognition and assisting in manual video editing. As such, its something our team at gifs.com cares deeply about.

The goal of SBD is to split an edited video into consecutive frames which show a continuous progression of video, as shown in this illustration:

Short boundary detection allows to split a video into a set of shots

Unfortunately, however, all existing shot detection methods are have limited speed and accuracy. They often fail to detect slow or subtle transitions and are misled by strong visual changes, which are falsely labelled as shot changes. We show some typical failure cases below.

A typical failure case of existing methods: missing a dissolve transition
A challenging example, where there is a camera flash. Thus, there are strong visual changes, but it is not a shot change. Previous methods often falsely label this as a shot change.

As shot detection is a core component of many of our products, we have have set out to improve SBD in accuracy as well as speed.

Today, I am happy to present the results of our efforts: Our new method for ridiculously fast shot detection with fully convolutional neural network. It is significantly more accurate than previous methods, while running at 120x real-time speed. Thus, we can analyze a full-length movie in less than a minute. Pretty nice, isn’t it?

Let’s see how it works.

Approach

Given that existing methods are not accurate enough, we turn to deep learning for salvation. Deep learning is a field of artificial intelligence that has shown strong performance in analyzing visual data, and therefore perfect for our problem of understanding videos. However, deep neural networks require large amounts of training data, in order to work well, and are computationally expensive when applied to video data.

Two core ideas allowed to make deep learning work for our task:
1. All shots boundaries are created manually, by taking raw video, shortening it and combining video shots with some type of transition.
Thus, this can also be done automatically, by taking raw video, splitting it (randomly) into parts and recombining these parts with various transitions such as cuts, dissolves or fades. Therefore, it is cheap to generate a large-scale dataset.

2. Understanding if a shot transition occurs at a particular frame requires looking at a context around it. In our work, we look at a context of 10 frames to decide if a shot change happens. To do this efficiently, we design a spatio-temporal deep network that is fully convolutional in time. This allows to take context into account without repeatedly analyzing the same frames. This idea of fully convolutional nets has previously been used for other applications such as Semantic Segmentation.

With these two ideas combined, we can train a deep neural network that can do fast and accurate shot detection. We tested our network on a standard shot detection dataset and obtained 88% accuracy, while the previously best method obtained 84%. That’s a 25% error reduction. Also, our method runs at 120x real-time (on GPU), while the baseline method only achieves 7.7x.

Remember the surfing video, where the dissolve is missed? With our new method it is now detected, thus allowing to correctly split the video into its parts.

Our method correctly detects the dissolve transition and allows to extract the individual shots from the example above

Failure cases

Partial scene change, which is not considered a shot change, as the change is in the background

While we made significant improvements, our method is not perfect yet. E.g. when only a part of the scene changes, the method often fails, as in the example on the left. We are still improving our method to also handle cases like this.

Conclusion

We have presented an accurate and extremely fast method for shot boundary detection. The method works particularly well in detecting standard transitions such as cuts, dissolves, fades and wipes.

If you are interested in reading more technical details, please check our arXiv paper.
Our new shot detection is already powering several products at gifs.com, such as our AI-based Gif creator, now giving you more relevant results that focus on what is most interesting. We are also providing a public API via Algorithmia, a machine-intelligence API marketplace. You can find it here.

--

--

Michael Gygli
Gifs.com Artificial Intelligence Blog

Head of AI @ gifs.com. Previously: PhD student @ ETH Zurich, Intern at Google Brain and Yahoo Labs