A.I. For Filmmaking

Recognising Cinematic Shot Types with a ResNet-50

Rahul Somani
21 min readSep 11, 2019
Screenshot from Kill Bill Vol. 1. The left side of the image shows a heatmap of the model’s activations.

Originally published at https://rsomani95.github.io.Visit the link for a better formatted, interactive version of the post with many more images.

GitHub: https://github.com/rsomani95/shot-type-classifier

Table of Contents

-What is Visual Language, and Why Does it Matter?

-Neural Networks 101 (Read if you don’t know what neural networks are)

-The Dataset

— — Data Sources

— — Shot Types

-Methodology

-Results

— — Training Performance

— — Confusion Matrix

— — Heatmaps (Highlight of the post)

— — Robustness

-Conclusion

Analysing cinema is a time-consuming process. In the cinematography domain alone, there’s a lot of factors to consider, such as shot scale, shot composition, camera movement, color, lighting, etc. Whatever you shoot is in some way influenced by what you’ve watched. There’s only so much one can watch, and even lesser that one can analyse thoroughly.

This is where neural networks offer ample promise. They can recognise patterns in images that weren’t possible until less than a decade ago, thus offering an unimaginable speed up in analysing cinema.

I’ve developed a neural network that focuses on one fundamental element of visual grammar: shot types. It’s capable of recognising 6 unique shot types, and is~91% accurate. The pretrained model, validation dataset (the set of images used to determine its accuracy), code used to train the network, and some more code to classify your own images is freely available here.

What is Visual Language, and Why Does it Matter?

When you’re writing something — an email, an essay, a report, a paper, etc, you’re using the rules of grammar to put forth your point. Your choice of words, the way you construct the sentence, correct use of punctuation, and most importantly, what you have to say, all contribute towards the effectiveness of your message.

Cinema is about how ideas and emotions are expressed through a visual form. It’s a visual language, and just like any written language, your choice of words (what you put in the shot/frame), the way you construct the sentence (the sequence of shots), correct use of punctuation (editing & continuity) and what you have to say (the story) are key factors of creating effective cinema. The comparison doesn’t apply rigidly, but is a good starting point to start thinking about cinema as a language.

The most basic element of this language is a shot. There’s many factors to consider while filming a shot — how big should the subject be, should the camera be placed above or below the subject, how long should the shot be, should the camera remain still or move with the subject, and if it’s moving, how should it move? Should it follow the subject, observe it from a certain point while turning right/left or up/down and should the movement be smooth or jerky. There are other major visual factors, such as color and lighting, but we’ll restrict our scope to these factors only. A filmmaker chooses how to construct a shot based on what he/she wants to convey, and then juxtaposes them effectively to drive home the message.

Let’s consider this scene from Interstellar. To give you some context, a crew of two researchers and a pilot land on a mysterious planet to collect crucial data from the debris of a previous mission. This planet is very different from Earth — it is covered in an endless ocean, and its gravity is 130% stronger than Earth’s.

This scene consists of 89 shots, and the average length of each shot is 2.66 seconds.

For almost all the shots showing Cooper (Matthew McConaughey) inside the spacecraft, Nolan uses a Medium Close Up, showing Cooper from the chest up. This allows us to see his facial expressions clearly, as well as a bit of the spacecraft he’s in and his upper body movements. Notice how the camera isn’t 100% stable. The camera moves slightly according to Cooper’s movements, making us feel more involved in this scene.

A Long Shot shot shows the character in its entirety along with a large portion of the surrounding area. In this scene, they’re used when showing characters moving around the space.

An Extreme Wide Shot puts the location of the scene in perspective. Characters occupy almost no space, and the emphasis is purely on the location. Note that this shot is also much longer than the others, allowing the grandness of the location soak in.

These are the main kinds of shots used in this scene. There are a few more types of cinematic shots that will be covered later.

Let’s move on to camera movement. Throughout this scene, the camera is almost never stationary. Whether it’s the spaceship getting hit by the wave, a slow walk across the ocean, or desperate running from one point to another, the camera moves in near perfect sync with the characters. This is what really makes you feel the tension like you’re in the scene.

A pan is when the camera stays at the same point and turns from left to right, or right to left. With the camera fixed at one point, you experience the shot as though you were standing there and watching CASE (the robot) move back and forth. It’s difficult to articulate why, but this seems more appropriate than, say, if the camera weren’t watching CASE move but instead moving along with it.

A tilt up is used appropriately to reveal the wave from Dr. Brand’s (Anne Hatheway) perspective. As the camera moves up to reveal the height of the wave, the gravity of the situation builds up (no pun intended). This is one of the longest shots in the scene, clocking in at 7.5 seconds. Do you think it would be as impactful if the camera was stationary and placed such that you could see the wave in its entirety?

The decision behind the different elements of a shot: shot scale (Long Shot, Wide Shot, etc), camera movement, camera angles, length of the shot, shot composition, color, and lighting are based on the message that the filmmaker wishes to convey. These shots are then juxtaposed meaningfully to convey a coherent visual story.

The analysis done above is far from comprehensive, but (hopefully) shed some light on how intricate creating effective cinema can be. The curious reader may want to dig deeper and look at how other factors such as composition, editing and color impact visual storytelling.

Breaking down this scene took a few hours of work. At the cost of being repetitive, this is where neural networks offer ample promise. With smart algorithms finding patterns like the ones shown above in a matter of seconds, your frame of reference could no longer be restricted to what you or your colleagues have watched, but all of cinema itself.

Neural Networks 101

‘AI’ is most often a buzzword for deep learning, the field that uses neural networks to learn from data.

The key idea is that instead of explicitly specifying patterns to look for, you specify the rules for the neural network to autonomously detect patterns from data. The data could be something structured, like a database of customers’ purchasing decisions, or something unstructured, like images, audio clips, medical scans, or video. Neural networks are good at tasks like predicting a customer’s desired products, differentiating the image of a dog and a cat, the mating calls of dolphins and whales, a video of a goal being scored vs. the goalkeeper saving the day, or whether a tumor is benign or malignant.

With a large enough labelled dataset (say 1000 images of dogs and cats stored separately), you could use a neural network to learn patterns from these images. The network puts the image through a pile of computation, and spits out two probabilities: P(cat) and P(dog). You calculate how wrong the network was using a loss function, then use calculus (chain rule) to tweak this pile of computation to produce a lower loss (a more correct output). Neural networks are nothing but a sophisticated mechanism of optimising this function.

If the network’s output is far off from the truth, the loss is larger, and so the tweak made is also larger. Tweaks that are too large are bad, so you multiply the tweaking factor with a tiny number known as the learning rate. One pass through the entire dataset is known as an epoch. You'd probably run through many epochs to reach a good solution; it's a good idea to tweak the images non-invasively (such as flipping them horizontally), so that the network sees different numbers for the same image and can more robustly detect patterns. This is known as data augmentation.

Neural networks can transfer knowledge from one project to another. It’s very common to take a network that’s been trained with 14 million images of a thousand common objects (ImageNet), and then tweak it to adapt to your project. It works because it has already learnt basic visual concepts like curves, edges, textures, eyes, etc, which come in handy for any visual task. This process is known as transfer learning.

Rinse and repeat this process carefully, and you have in your hands an ‘AI’ solution to your problem (Of course, this isn’t all that deep learning is about. What I’ve described here is supervised learning. There are several other sub-fields of deep learning that are more nuanced and cutting-edge, such as unsupervised learning, reinforcement learning, and generative modelling, to name a few).

If that piqued your interest, I suggest you watch this(~19mins) for a fairly detailed explanation of how a neural network works. If you’re bursting with excitement, follow through with this course.

The Dataset

There is no public dataset that categorises cinematic shots into shot types. There is one prior research project that classified shot types using neural networks, but the dataset, although massive (~400,000 images), only had 3 output classes (shot types) from 120 films.

Thus, the dataset for this project was constructed from scratch. It is diverse, consisting of samples from over 800 movies, collected from various sources. Each image has been looked over 4–5 times to ensure that it has been categorised correctly. Since shot types have room for subjectivity, it’s important to note that there is no Full Shot as describedhere. Instead, there is a Long Shot which is essentially a Full Shot with some leeway for wideness, as described here and evident in the samples shown below.

In total, the dataset consists of 6,105 (5,505 training + 100 validation) images, split into 6 shot types:

In a future version, I will be adding two additional shot types: Wide Shot and Medium Long Shot.

Data Sources

  • 71 films (30–60 images per film) from the film-grabdatabase
  • 700+ films (3–6 images per film) from RARBG
  • 298 Extreme Wide Shot images from Pexels
  • Randomly picked images from relevant google image searches, such as this
  • 552 Extreme Close Ups from Jacob T. Swinney’s video essays on Tarantino,Aronofsky and Fincher.

Shot Types

Extreme Wide Shot

An Extreme Wide Shot(EWS) emphasises the vastness of the location. When there is a subject, it usually occupies a very small part of the frame.

Extreme Wide Shots

Long Shot

A Long Shot (LS) includes characters in their entirety, and a large portion of the surrounding area.

Long Shots

Medium Shot

A Medium Shot (MS) shows the character from the waist up. It allows one to see nuances of the character’s body language, and to some degree the facial expressions.

Medium Shots

Medium Close Up

A Medium Close Up(MCU) shows the character from the chest/shoulders up. It allows one to see nuances of the character’s facial expressions, and some upper-body language.

Medium Close Ups

Close Up

A Close Up (CU) shows the face of the character, sometimes including the neck/shoulders. Emphasises the facial expressions of the character.

Close Ups

Extreme Close Up

An Extreme Close Up(ECU) highly zooms in to any one feature of the subject to draw attention to that feature specifically.

Extreme Close Ups

Wide Shot

A Wide Shot falls between an Extreme Wide Shot and Long Shot. The emphasis is on the physical space that the characters are in.

Wide Shots

Medium Long Shot

A Medium Long Shotfalls between a Long Shot and a Medium Shot. Characters are usually shown from their knees up, or from the waist up + a large portion of the background visible.

Medium Long Shots

Methodology

The training process was in sync with fastai’s methodology(transfer learning,data augmentation,fit one cycle policy,learning rate finder, etc). This is well documented in this Jupyter notebook.

While doing this project, I thought up and implemented a new kind of data augmentation called rgb_randomize that is now part of the fastai library.

The model architecture is aResNet-50 pretrained on ImageNet. The only part of the training process that I haven't seen used extensively, though mentioned in this lecture, was that of progressive image resizing (perhaps cyclical transfer learning is a better term)*. This is a process where you take a pretrained model (almost always a ResNet-xxpretrained on ImageNet), fine-tune it on your dataset, and then repeat the process with a larger image size. Here's a visual explanation:

This diagram of a ResNet-50 isn't 100% accurate — it doesn't include skip connections. In a future post, I will explain how a ResNet works with a more complete version of these diagrams.

At each stage, the network is first trained when frozen with a learning rate of ~1e-3 and then unfrozen and fine tuned with a learning rate of ~slice(1e-6, 1e-3)

Note that you cannot replicate the training process as the training data has not been made public yet (it’s still a work in progress). However, you can use my pretrained model to predict shot types on your own images, or to re-evaluate the validation set.

Results

Training Performance

When training the network, the dataset is split into the training set (5,505 images) and the validation set (600 images). The network learns from the training set, and evaluates its accuracy on the validation set. As explained in Neural Networks 101 earlier, the network learns by optimising a loss function. That loss function is the training loss in this table.

You could theoretically reach a training loss of 0, but this would mean that the network hasn't learnt patterns from your dataset, but memorised it instead. To ensure optimal performance, the metric you keep an eye on is the validation loss. The network's never seen images from the validation set, so as long as the validation loss keeps going lower, you're going in the right direction. The accuracy is simply the percentage of correctly predicted predicted images.

This table is extracted from the methodology notebook and shows the last stage of training — Stage 3.2.

The network peaks at an accuracy of 91.0% on the validation set, which a great result. The accuracy would probably be higher on a larger dataset, as each image wrongly classified would account for a smaller % decrease in accuracy. Gains from a lower validation loss would be more palpable, the opposite of which can be seen in this table due to the small size of the validation set.

Confusion Matrix

It’s clear that the model performs best for Extreme Wide Shots (EWS), Medium Close Ups (MCU), Close Ups (CU), and Extreme Close Ups (ECU). It struggles a bit when detecting Long Shots (LS), often confusing them with Medium Shots (MS).

Heatmaps

Heatmaps represent the activations of the neural network. In layman’s terms, they show the parts of the image that caused the network to detect its predicted shot type. On the actual website, these images are displayed with sliders, not side by side. I highly recommend you view them there for a better experience.

Extreme Wide Shots

The network has learnt to recognise patterns on both the ground and the sky. The foreground is an important factor, as is the human figure when detected. It’s interesting to see the network’s behavior in low lighting conditions.

Long Shot

Presence of the entire leg and/or parts of the human body is a determining factor. The network can detect objects in the background. Predictions remain stable in low-lighting and are resilient to different camera angles and human pose.

Medium Shot

The cleanest predictions are based on the presence of a human face and its spatial relationship with the rest of the frame. In lower lighting conditions, it uses other parts of the body and/or elements in the background. It isn’t restricted to the presence of a face.

Medium Close Up

The face is clearly the defining factor. The network focuses in on a more specific part of the face when compared to a Medium Shot. When the typical human face isn’t in the image, it uses other elements in the frame.

Close Up

The face is all the more the defining factor. The network focuses in on an even more specific part of the face when compared to a Medium Close Up. It uses other elements in the frame when the face is obscure, the camera angle isn’t the most common one, or in low lighting conditions.

Extreme Close Up

This is harder to interpret due to the diversity of these shots. The network does a good job regardless, it scored 95/100 with ECUs.

Robustness

It’s important to take these results with a pinch of salt. The validation set, while diverse, is not comprehensive. For all shots besides an Extreme Wide Shot or an Extreme Close Up, most of the training samples were images that had humans in them. Thought this is usually the case in movies, it isn’t so all the time.

I’ve also refrained from including non standard compositions from the training and validation set for two reasons: they represent a minority of shots, and they are difficult to classify. That being said, the model performs well on most commonly found shot types across most cinema.

Left: Is this a Close Up? The model is ~80% sure it’s an MCU and ~20% sure that it’s a CU. Right: Is this a Long Shot? The model is baffled. It predicts ~18–22% for each shot class.

The model might be biased against “foreign” (non-Hollywood) cinema, as it was trained on mostly Hollywood movie stills. However, the heatmaps above show that it is able to detect fundamental features that are seen across all cinema. The model hasn’t been tested thoroughly enough to make a definitive statement regarding this.

Neural networks burst into popularity in the past decade with the development of large datasets and the ability to leverage GPUs (graphics cards) for the heavy computation demanded by neural nets.

Rapid advances on the technical side open up opportunities to solve novel problems like the one presented in the post. Shot scale recognition is one of many possible applications to film. It’s possible to recognise camera movement with 3D CNNs(convolutional neural networks); the only missing piece is the dataset. Camera angles could be detected with the same methodology as this project, but the dataset for it doesn’t exist either. Cut detection — transition from one shot to the next — has been worked on extensively, and can be adopted to film.

Frederic Brodbeck’s Cinemetrics and Barry Salt’s Database are two projects worth looking into when looking at film analysis.

When created on a large scale, datasets like this can give us invaluable insights for creating effective shot sequences. As stated earlier, they’d also immensely broaden our scope of reference. What’s most exciting about this is that it’s all software-driven, making it widely accessible and easy to scale up.

The code and the model are publicly available here. I hope you enjoyed reading this as much as I did writing it. If you’d like to share your thoughts, you can leave a comment below, or reach me via email.

Follow the Conversation on Twitter

References

Visual Language

Previous Work

  • Shot Scale Analysis in Movies by Convolutional Neural Networks Savardi, Mattia & Signoroni, Alberto & Migliorati, Pierangelo & Benini, Sergio. 25th IEEE International Conference on Image Processing (ICIP). 2018. This paper was the first one to classify shot types using neural networks. They built a massive dataset (~400,000) images to classify 3 shot types: Long Shots (EWS, WS, LS), Medium Shots (MLS, MS, MCU) and Close Shots (CU, EWS) and achieved an overall accuracy of ~94%.

Technical

Other Related Work

  • Barry Salt’s Database. One of the earliest quantitative databases on film data, built manually. Consists of three databases: Camera movement, shot scale and average shot length.
  • Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh. arXiv 2018. An excellent paper that extended the ResNet architecture to the video domain. They released multiple models, and the best ones come close to/ are better than the state of the art (varying based on the dataset). The code is available here.
  • Cinemetrics. Frederic Brodbeck’s graduation project at the Royal Academy of Arts in The Hague. He create a visual “fingerprint” for a film by analysing color, speech, editing structure and motion.
  • Fast Video Shot Transition Localization with Deep Structured Models. Shitao Tang, Litong Feng, Zhangkui Kuang, Yimin Chen, Wei Zhang. arXiv 2017. A more recent paper on SBD (shot boundary detection) that introduced a complicated yet effective model and a more sophisticated dataset — ClipShots — for cut detection. Detects hard cuts and fades. The code is available here.
  • Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. Joao Carreira, Andrew Zisserman. arXiv 2017. This paper introduced the Kinetics dataset, which is a landmark dataset in the video domain, and introduces a unique model (a combination of networks, to be precise) that achieves state of the art performance in existing video datasets. The code is available here.
  • Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks. Michael Gygli. arXiv 2017. One of the first papers that used convolutional neural networks to detect when a cut happened in a video, and also the kind of cut — hard cut or fade.

--

--

Rahul Somani

I’m an aspiring data scientist interested in exploring how deep learning can enhance creative expression.