Leveraging AI and Deep Learning for Video Summarization

Divya Jain
May 16 · 5 min read

The global video market is taking center stage — According to Forbes, more than 500 million hours of video are watched on YouTube every day. Google adds that almost 50 percent of internet users look for videos related to a product or service before visiting a store.

Many such statistics show how video content is growing and will remain the mainstream as a means of sharing information. We are already seeing a shift from copy and text to snapshot stories and visual posts (e.g. Instagram) for sharing content. Artificial intelligence (AI) is also playing a large role in this shift to video. We can use AI to improve video quality by stabilization, to understand and classify content for editing purposes, or to better deliver and target.

AI is also playing a key role with video summarization, a process of shortening a video by selecting keyframes or parts of videos that captures the main points in the video. Summarization has many use cases, with one of the most significant being the ability to gauge interest in content. A flashcard summary can determine how many people will actually watch an entire video. Even a single thumbnail plays a crucial role in determining how many people will click on a video to play it. Along with determining video clicks, video summarization is also necessary for efficient viewing of the material and for video length adaptation for different mediums, like Instagram, Facebook, etc.

Recently, there have been many advances in using deep learning to increase the processing of images; the ability for AI to understand an image’s context has rapidly improved in accuracy. Similar techniques can be used to understand videos too, but this is a much more complex process. Video is not just a collection of a large number of frames or images, but videos are multi-dimensional — including audio, motion, and a time-series dimension. Each of these dimensions are key in understanding a video, and depending on what the summarization is targeting, different dimensions can be crucial.

The anatomy of AI video summarization

Video summarization can be categorized into two broad areas of machine learning: supervised and unsupervised. Supervised summarization entails learning patterns from previously annotated videos and examples. This works very well in case of videos where a pattern exists, like sporting events. For these videos, we can annotate some sequences and learn from them. However, the biggest challenge with supervised learning is the labeled data. It is costly to create these well-defined datasets. Labeling of data requires domain knowledge and does not work well when it comes to the wide variety of content that is present on the web.

The other machine learning form of summarization is unsupervised, where a smaller number of frames are selected from the original video through change detection in the video. Low-level features such as color, motion, and texture have been commonly used to create histograms and clusters to determine the similar frames within a video; a few frames are then selected that are deemed useful for the summary based on the information that they are conveying from the original video. These techniques work best when the video have distinct visual content, for example a video taken throughout the different days of a vacation. However, these summaries often lack the context and come out as disjointed images.

Recent forms of deep learning look very promising in addressing the above mentioned challenges; they lend themselves to much more effective creation of video summaries. While supervised deep learning techniques popularized the process, unsupervised techniques such as generative adversarial networks (GANs) and reinforcement learning are showing great promise, offering excellent advantages that are making them a forerunner in video summarization.

The power of emerging unsupervised deep learning techniques in video summarization

For videos that don’t adhere to any pattern and are completely different from each other, GANs work very nicely. GANs have two neural nets: an encoder that tries to mimic the real data and a decoder that is trying to learn if the generated data is fake or not. This helps GANs learn the data distribution very effectively and create data that is very difficult to distinguish from the original dataset. In this case, each video can be described as a dataset, with GANs creating a subset of frames that are most representative of the given videos. This generates unique summaries for videos while preserving the context and meaning of the videos themselves. This technique can be used by marketers for creating smaller versions of full length ads or campaigns based on the devices and target the right audience. This can also be used by creative artists to give a preview of their upcoming releases.

For videos that have a common structure, like sporting events, reinforcement learning is more effective than supervised learning because it does not require labeled data. Here, the neural nets can learn which frames to choose based on a reward function. They learn from previous summaries to determine whether certain frames were watched or skipped. Different kinds of rewards functions can also be defined in ways where previous information is not required, such as frame diversity and representativeness or frame category classification. Such techniques can be employed by campaign managers to create more watchable and memorable summaries from past experience and engage with their customers effectively.

At Adobe MAX 2018, we previewed a proof-of-concept demo of unsupervised deep learning technology in action, through our Video Ad AI presentation during Adobe Sneaks. We took the example of a 60-second TV spot that needed to be rolled out across social media platforms. The video is first auto-tagged for benchmarking against similar content, and then Adobe Sensei, our AI and machine learning technology, assesses past performance of similar content. This leads to a set of recommendations to improve the ad’s effectiveness score (e.g. to change the length and flow of the video), and those are sent to video advertisers in Adobe Premiere Pro, along with the original ad, for editing. Advertisers can adjust the ad to retain creative elements, while aligning with marketing objectives to deliver high-performing content. Check it out in the video below:

These new unsupervised techniques are just the start of a new era in deep learning technology when it comes to video summarization. Many advances will be made in the near future to create and optimize the best summaries based on the audience, delivery medium, and intent of summarization. Our Adobe Sensei team is working with Adobe Research to bring these techniques to the Sensei framework. Together with efforts across the industry, we’ll make video summarization highly scalable, reliable, and incredibly efficient.

Portions of this article were originally published on ClickZ.

Follow the Adobe Tech Blog for more developer stories and resources, and check out Adobe Developers on Twitter for the latest news and developer products.

Adobe Tech Blog

News, updates, and thoughts related to Adobe, developers, and technology.

Divya Jain

Written by

Passionate about ML/AI, Startups, Innovative Tech, Women in Tech, Family and Friends. ML Director @adobesensei, but views here are mine

Adobe Tech Blog

News, updates, and thoughts related to Adobe, developers, and technology.