Just Add Trunk Shots? The Subtle Art of Editing with Machines (& How To Look Like Tarantino)
Tarantino’s trunk shot, John Woo’s doves, Wes Anderson’s symmetry. Some directors have such signature motifs, they’re almost cliché. If you want to pay homage to one of these directors, TRASH can help in an instant.
But what about all the other things that go into making a Tarantino film? The long takes, the 360 shots, the god’s eye perspective, the bold primary colors, and black and white sequences? One of the most laborious things about making a movie is all the tiny decisions that add up to be greater than the sum of its parts.
Trying to make something in the style of another artist is one of the best ways to learn a craft. From art school to meme culture, this is step one.
At TRASH, we’ve been teaching our AI about all these little moments that add up to make something bigger — like your own video in the style of Tarantino, and other iconic directors. (Quentin if you’re reading this, we’re just trying to learn from one of the best!).
There are three key steps in this process:
- Training: teach our AI the concepts of film
- Reading: analyze your footage to find the right moments
- Writing: generate a rough cut based on that analysis
Step one: finding the important bits in footage and teaching a neural network to recognize those things. We have a large repository of film footage and user-generated content. We use a process called active learning to label our internal dataset with all of the cinematic concepts we think are useful for video summarization and editing.
Then we use our enormous labeled dataset to train a super-fast network that does three things: attends to the most interesting parts of footage, uses the video’s audio to inform that decision, and predicts the content of the video.
Step two: your video has three key streams of information — we look for the content (concepts), the story (voxels) and we listen to the audio. We use the snappy video/audio attention network we trained in the previous step to identify which voxels (regions of the video) are regions of interest. Then we decide what’s in those voxels of interest — for example, if they contain a ‘wide shot’ or a ‘close-up’, a ‘selfie’ or a ‘rainy, urban’ landscape.
Step three: assembling the set of interesting cuts into a story sequence. In our app, we use multiple techniques for deciding what order to put the best cuts in (let us know if you’d like another blog on our set-to-sequence networks!).
In this post, we are trying to mimic the story architecture of a trailer for Quentin Tarantino’s Pulp Fiction. To do this, we use our voxel-region-proposal and content prediction network from the last step to characterize how this trailer was put together. Then we use a greedy approach to satisfying the constraints the original trailer laid out for us.
Keep reading to learn about the training, reading, and writing steps to video creation. We also show a shot-by-shot comparison of the original trailer and four new trailers we made from the footage of three other famous directors and our community! 📽
Step One: Understand the content
Here are some simple examples of the cinema concepts we train our neural nets on so that we can identify this type of content in your videos.
There are no off-the-shelf detectors for the types of concepts we need for video editing, so we created our own datasets and neural networks. To get our own cinema concepts dataset — one that included concepts and footage shot on mobile — we compiled a large set of unlabeled videos from academic datasets and our early alpha testers. This was just a pile of unlabeled footage, no way of knowing what was in it!
Next, we recruited some film experts who could tell us what to look for when editing a cinematic sequence. They helped us make an ontology of concepts, such as ‘wide shot’, ‘two shot’, ‘symmetry’, etc. Combined with ontologies for describing places and people previously compiled by our CTO Geneviève, we ended up with a vocabulary for describing visual stories that contained hundreds of concepts.
Once we knew what we were looking for, we set to work finding those things in our big pile o’ film. With the help of a few film student interns, we used the process in the figure below to bootstrap our labeled dataset.
First, one of our experts would find a few examples of a concept (here: ‘trunk shot’). Then through an iterative process called active learning, we built a progressively more confident classifier for that concept. At each step of the iteration, our film students would confirm or reject the hypothesis of the bootstrapped classifier. Finally, when the classifier passed a test on some held out footage, it was applied to our whole unlabeled dataset to give us our final labels.
Once we had our final labeled video dataset, we were able to train a network efficient enough to run on a phone that could predict all of our concepts at the same time in new footage (see the results at the top of this section).
Step two: Try to find the story
Like the script doctors of Hollywood, we’re hard at work in your video trying to find a story. Teasing out a story is one of the most difficult things a human can do, so it’s not surprising that it’s even more difficult (and of course way more rudimentary) for a machine.
Understanding why certain moments are funny or have dramatic tension — for example Mia Wallace and Vincent Vega’s expressionless dance number in Pulp Fiction — is still an open challenge.
However, if there is one concept that humans and machines are great at though, it’s patterns. The same way a huge proportion of the world’s hit songs are built upon the same four chords, Vonnegut famously (and hilariously) pointed out how classic stories share the same “shape”.
Our AI looks for story patterns (made up of the cinematic concepts from before) in the short films (or internet videos) we use for training. We attempt to mimic those patterns in the TRASH app, and in the video examples below!
First, our in-app neural network (NN) takes a pass at guessing the regions-of-interest in raw input footage. For this experiment, we took trailers from 3 other directors (John Woo, Wes Anderson, and Sophia Coppola) and used the cuts in those trailers as our voxels of interest. (We use optical flow to determine cut boundaries. Let us know if you’d like us to release an Ipynb for doing that!).
For each video-region, we project that voxel into our concept embedding space. In the figure below we show the part of our embedding space that has to do with concepts that describe the characters in our stories — people, animals, food. We project our high-dimensional embedding space into 2D for visibility. Each dot represents a video snippet in our training set.
We identify story patterns by treating the embedding space like a map and learning typical trajectories through the space. For example, film sequences often start out ‘wide shot’, ‘medium shot’, ‘close-up’. Tarantino often uses a ‘close up, angry face’, ‘close up, scared face’, ‘close up, angry face’, ‘dangerous event, wide shot’ sequence.
We train neural networks to learn the most likely sequences of our concepts and order our output videos according to the right pattern given the content of the raw footage. In this experiment, we already had a pattern that we wanted to follow, so we greedily selected the closest matching shots and shot sequences from our input footage to line up with the Pulp Fiction trailer.
Step three: Composing an homage
Of course, it’s not Tarantino! But it can get you quite a long way to looking like his style with one tap, and without having to know what goes into all those little editing decisions. In fact, all you might need is a trunk shot 😉 This helps you learn, participate in the medium of video, and become a better creator yourself!
In this shot by shot comparison, you see can the AI selects Tarantino-esque elements like the 2 shot, looking upwards, black title cards (with red and yellow text) from the footage of the other directors. In the first row, we see an early shot of Jules Winnfield leaning menacingly over someone who may not be long for this world.
Our AI is able to find a shot from a similar angle in the John Woo footage and in our user-generated content. In the Anderson and Coppola footage, it does the best it can because neither of those directors had intimidation scenes in their sampled trailers.
In the last row, something similar happens. The AI finds a wide shot in Woo’s footage and in the UGC but has less luck with the other directors. Interestingly, in the second row the AI matches ‘driving’ from Tarantino and Woo (albeit a boat and not a car), but matches the ‘2 shot’ aspect in the other trailers.
Note: these shots weren’t cherry-picked. We tried to space them out evenly throughout the trailer and diversify shot type.
To see how well the AI approximated Tarantino, check out the trailers! Note that our AI tried to match content and sequence, but the original cuts from the 3 source-material directors were not always of matching length with the cuts from the Pulp Fiction trailer. This means our homage trailers are of varying length! In our UGC-footage video, we had more control over cut length, but the original footage was often very short (shot on a phone!), so our AI did its best there too.
Want to take it for a spin? Download TRASH to see your own video clips look instantly more pro!
– Gen, Han & Team TRASH 😇