Developing ML to Automate Thematic Video Editing 🎞

By Govin Vatsan

Aug 14 · 3 min read

Imagine you were trying to automate thematic video editing. That is, you wanted to be able to take a bunch of diverse video clips, select a theme (like “music video), and get an output video that stylistically looked like a music video. During my summer internship at TRASH, I worked on researching machine learning techniques that would allow you to do exactly this!

Let’s consider a simpler version of the problem. Given a set of input video clips and some target editing style (e.g. music video, action movie, horror movie, etc.), can we pick and choose some of those clips to make the resulting video look as close to the editing style as possible? And if we had those clips, what order would we put them in? Would we repeat any? Which ones would we discard?

So for example, a target editing style could be “action movie trailers”, the input could be a set of video clips from various action movies, and the goal would be to produce a new video that a human would classify as an action movie trailer. In order to get this working, the model has to first learn to recognize important characteristics of the target editing style and then must select and order video clips matching that style 🤖🧠

At TRASH, I researched two methods of attack for this problem, supervised and unsupervised learning.

Left: supervised learning, where we want to learn how to separate objects from different classes. Right: unsupervised learning, where we need to learn to group similar objects together. Source: Deep Learning for Image Recognition: why it’s challenging, where we’ve been, and what’s next

Supervised 🎯

In supervised learning, we have a target output for each input in our machine learning model. For example, if we are given a set of input video clips, then we already have a target output sequence in mind. And by training a deep neural network to learn to reproduce this target output, we hope that it will be able to generalize to new, unseen video sequences by understanding distinct properties of each target editing style.

Unsupervised 🔍

In unsupervised learning, we have no target output for our input data. To get around this limitation, I used a GAN (Generative Adversarial Network). A GAN is a combination of two competing neural networks that allow for the creation of synthetic data. The first network, the Generator, is given input video clips and must produce an output video sequence matching a target editing style. The second network, the Discriminator, must determine which video sequences are real and which are fake. For example, if the target class is music videos, then the discriminator would get both the generator’s created music video and a clip from an actual music video to compare. The better the Generator gets, the harder it becomes to tell real and fake videos apart!

Model overview: input videos and an editing style are passed through two neural networks, an encoder, and a decoder. The decoder produces a new video output, conditioned on the editing style.

Why is this Important? 🧐

At TRASH, we want to automate video editing. Developing machine learning algorithms that can recognize and piece together different types of videos is just one step towards this ultimate goal. We want a user to be able to input a bunch of different videos they took and get a polished video as an output — but this is a challenging problem. It requires understanding the components of both videos and video editing. My work focuses on the editing part: if we can use machine learning to make sense of the narrative structure of video editing, then hopefully that goes a long way to support our overall goals. It can also allow us to make funny videos like this one 🐐

Highlight reel of goat screams made by an Unsupervised Model

– Govin, Gen (& Team TRASH)




Written by

You shoot we edit. A new social video app in Creator Beta. DM us on Twitter @thetrashapp for access 😇