Identifying Shot Scale with Artificial Intelligence

Why Study Shot Scale?

11 min readSep 15, 2022

This post is the first in a series. The code for this project can be found on my GitHub. The Google Colab notebook used to analyze the data can be found here.

Localized face in the final shot of Memories of Murder (2003).

Shot Scale and Machine Learning

Shot scale is one of the most fundamental ways we classify frames when analyzing cinema. Words like “close-up” or “establishing shot” are so integral to how we talk about movies that it’s hard to imagine life without them. Wouldn’t it be wonderful if we could use artificial intelligence to classify frames by shot scale? That would be awesome, but how might we accomplish it?

Machine learning comes in two general forms: supervised and unsupervised. Supervised learning is the most commonly used method in image classification. In this method, we know the correct labels beforehand because an actual person labeled them. In unsupervised learning, however, we do not know them in advance. Unsupervised learning typically uses some form of cluster analysis to exploit the underlying structure in the data.

However, unsupervised learning generally requires structured data, which videos are not, so we cannot use it to determine shot scale automatically. Unsupervised learning is more limited in utility than its supervised counterpart but requires substantially less labor. Supervised learning has broader applications but requires a human being to label every image manually.

Just as there are two types of machine learning methods — supervised vs. unsupervised — there are two types of machine learning prediction tasks — classification and regression. Classifiers predict discrete data (countable), while regressors predict continuous data (uncountable). The terms we use for shot scale — close-up, medium-shot, establishing shot — are discrete, so we should expect to train a classifier to recognize these categories.

Training an image classifier seems to be the only way to use machine learning to classify shot scale automatically. Doing so would require manually collecting and labeling thousands of images, with no guarantee of success. However, the folks in the Department of Information Engineering at the University of Brescia and the Department of Film Studies at the University of Budapest have done exactly that by creating an extensive dataset of human-labeled shot scales called CineScale. Because this dataset is so exciting, I will explore it in detail in a subsequent post.

Shot scale is often ambiguous, which can be illustrated through an example from the canonical introductory textbook to film studies, Film Art: An Introduction by David Bordwell, Kristin Thompson, and Jeff Smith:

The framing of the image stations us relatively close to the subject or farther away. This aspect of framing is usually called camera distance. The terms for camera distance are approximate, and they’re usually derived from the scale of human bodies in the shot. Our examples are all from The Third Man. In the extreme long shot, the human figure is lost or tiny (5.106). This is the framing for landscapes, bird’s-eye views of cities, and other vistas. In the long shot, figures are more prominent but the background still dominates (5.107). Shots in which the human figure is framed from about the knees up are called medium long shots (5.108). These are common because they permit a nice balance of figure and surroundings. The medium shot frames the human body from the waist up (5.109). Gesture and expression now become more visible. The medium close-up frames the body from the chest up (5.110). The close-up is traditionally the shot showing just the head, hands, feet, or a small object. It emphasizes facial expression, the details of a gesture, or a significant object (5.111). The extreme close-up singles out a portion of the face or isolates and magnifies an object (5.112).

The authors provide a handy figure that illustrates the different shot scales.

Figure from Film Art: An Introduction by David Bordwell, Kristin Thompson, and Jeff Smith.

Even when there are only seven different shot categories, we can see how difficult it could be to distinguish, say, a medium long shot from a medium shot (I don’t know what’s going on, but 5.108 looks closer than 5.109 to me; seems like we should reverse them). If we asked people to classify shots, I imagine results would differ widely from person to person. But why?

The terms for camera distance are approximate, and they’re usually derived from the scale of human bodies in the shot.

The human body serves as the primary marker of shot scales with thresholds at various points in the body (feet, torso, bust, head, etc.). Which body parts are present within the image is highly correlated to the relative size of human beings within the frame. In the “extreme close-up” example above, only the eye is visible, not other features of the face, which means the relative size of the human figure in the image must be exceedingly large.

However, we should always recognize that while shot scales are usually derived from the size of people in the image, they are not always so. We could probably restate it, saying that we derive shot scales from the relative size of the primary object of interest, which usually is the human face. This salient object can conceivably be anything, as long as it’s the (metaphorical) center of focus. Shot scales are determined by the size of the most prominent figure in the image, as illustrated below.

Wouldn’t we say this is a close-up? Frame from Detour (1945).

In mainstream commercial cinema, the human face is overwhelmingly the most salient object in the frame. And if it’s the most salient object, its size must overwhelmingly determine shot scales. So if we could estimate the relative size of the most pronounced human face, we might be able to ascertain its correct shot scale label.

How Useful Is Our Shot Scale System?

My goal is to track how shot scale changes over time. One reason for doing so is that we can better answer particular historical questions, like did the emergence of widescreen result in fewer close-ups, as has often been suggested? And what about the arrival of sound? This project could provide us with much greater insight into how shot scale has developed historically and, just as importantly, identify those features within the data that drive these historical changes.

But there is also a more fundamental reason for undertaking this project. Film scholars, viewers, and critics have found utility in our shot scale system. As I introduced at the start of this post, words like “establishing shot” are so ubiquitous that we often take them for granted. But if we want to address the problem of cinematic framing quantitatively, how useful is it to assign a shot scale label to individual frames?

A lot hinges on whether we consider shot scales as continuous or discrete phenomena. Earlier I mentioned that our classification system is discrete since we use categorical terms. Shot scales like “close-up,” however, are determined by the relative size of the predominant figure in the image. And the relative size is continuous. Qualitative thresholds segment such continuous data into discrete, categorical terms (e.g., close-up).

Because there is no uniform classification method, and the threshold points are predominantly subjective, discrete shot scales may be problematic to analyze quantitatively; so much depends on how the categories are defined. Therefore, focusing primarily on the relative size of the largest face in the image may provide greater utility than labeling images as a “medium shot” or a “long shot.” I think we can test this, however. Despite the somewhat ontological framing, this distinction is actually easier to express statistically.

If our shot scale labels actually have stable identities, then we should expect the resulting data to have a multimodal distribution. In a multimodal distribution, the curve has multiple “peaks,” as seen below.

Schematic Example of a Bimodal Distribution

We should expect the values to sort of “cluster” around these peaks if the shot scales have stable identities. Because we are only in a 1-D space, we are not really clustering the data. Instead, we are locating “natural breaks.” In this case, the “local minima” form the break-points and the “local maxima” represent the “cluster centers,” which would be our shot categories (e.g., close-up). In the image below, for example, green markers represent the local maxima, or cluster centers, while red ones represent the thresholds between them. If the plot below depicts the distribution of shot scales, we could say that the green points represent long, medium, and close-up shots, respectively.

Artificial example to illustrate segmenting a kernel density estimate.

If shot scales like “close-up” don’t have stable identities, we should expect a more normal distribution. In other words, we assign labels to shots by locating arbitrary thresholds in a continuous variable (relative face size). However, film producers are not trying to make a “medium shot” per se. The scale is simply an effect of other narrative considerations, like deciding where in the image the spectator is meant to look, which characters should have priority in the frame, etc. In this case, it doesn’t seem necessary to split shots into these separate categories; we can only track the single quantity of relative face size.

Examples of different unimodal distributions. Red represents the *standard normal distribution.*

I think shot scales predominantly represent a way to talk about cinematic images but that things like “close-ups” don’t exist as such. There is no objective identity to these labels. While producers may be using these scale categories as a general guide, they are essentially subjective, and one person’s medium shot is not the same as someone else’s medium shot. Therefore, I posit that our data will resemble the standard normal distribution. If it doesn’t, that supports the view that shot scales do have stable identities.

How the Model Works

We first need a dataset to work with, which I’ve already collected and written about for another project. Next, we need a facial detection algorithm to estimate the relative size of the face, for which we can use an image classifier trained for facial recognition, such as the Python library Face Recognition, built from dlib with GPU support for faster processing times. Subsequently, we can extract sample frames from each video and run these images through our facial detection model.

For each sampled frame, we can locate bounding boxes containing recognized faces and then measure the relative size of those faces.

The model is not without its problems. It struggles on certain kinds of images, generally when much of the face is occluded or at an odd angle. We can see respective examples in the frames below.

The model fails to detect extreme close-ups, like this one from *The Good, the Bad, and the Ugly (1966)*.

The model struggles when faces are shown from odd or unusual angles, as in this frame from *The Passion of Joan of Arc (1928).*

We have one other factor we must consider. A single frame can contain many faces, so we must establish a protocol for when frames have multiple instances. I took the biggest face within each image since size generally equates to significance within the frame (only as a rule of thumb, of course). Another method is to average the face sizes. Later, we’ll have to see whether changing this method significantly affects our results.

Framing

Identifying shot scale is a worthy and fascinating pursuit, but what truly excites me is how facial detection can show us how filmmakers would frame faces within cinematic images. We calculate the relative size of the face in the image using the dimensions of the boundary box, as shown above. But we can also localize faces within the image, which will help us better understand how elements like genre, director, aspect ratio, color, and other factors affect cinematic framing.

Although we can localize a face in an image, I was initially unsure how to use this data. However, I took inspiration from Kevin L. Ferguson's project that “compresses” sampled frames from a film into a single image.

Featured image from Kevin L. Ferguson’s article “What Does a Western Really Look Like?”

Instead of averaging brightness across pixels, we can create a heatmap showing where faces appear most frequently within the image. While the colors shown above are “real” as they represent the average color value for each pixel, the colors in the heatmap below represent how frequently faces appear in each pixel (the exact colors, then, are arbitrary).

The image looks really cool, and it does give us some insight into how faces are framed in Michael Curtiz’s classic Casablanca (1942). We can see, for instance, that faces are framed at a similar height across the image. I would imagine this is because much of the narrative unfolds in conversations at seated tables. If we were to look at more of these images, we might be able to draw some further insights into particular films. However, we must inspect these qualitatively as these heatmaps do not immediately lend themselves to quantitative analysis.

If we want to quantify this framing data, we can do something else. We can break the frame into equal-sized rectangles and then measure the frequency with which faces appear in each region. But how many sections should we divide the frame into? There is a pictorial convention known as the “rule of thirds,” which is defined by Annette Kuhn and Guy Westell in the Oxford Dictionary of Film Studies as:

A flexible compositional ‘rule’ taught as part of painting and photographic practice and which may be extended to the framing of shots in filmmaking. Its aim is to indicate where significant elements may be placed in the frame in order to attract the viewer’s attention, and also produce a well composed — visually coherent and harmonious — image. This idea of composition, based on geometrical principles, stems from ideas developed from the ancient Greek and Roman periods which still hold sway in Western culture today, the argument being that geometrical ‘rules’ follow the ‘rules’ of nature. The rule of thirds ordains that the frame be divided into thirds both vertically and horizontally: if lines were drawn to mark these thirds they would look like the grid used to play noughts and crosses, but with flatter rectangular spaces. The intersections of the four gridlines represent the approximate points where objects in the frame would be placed. In the case of filmed closeups, for example, the subject’s eyes would be lined up to match the upper horizontal third. However, conventions of composition change and develop, and in filmmaking centred framings are more common than those using rule of thirds. (355)

If we carefully consider the quote above, we see that it actually makes a prediction that we will soon be able to test empirically. With our facial data, we can locate the average height of face locations and see how this has changed over time. We can also actually see whether “centered framings are more common than those using the rule of thirds,” as mentioned above. Further, we can see how technological developments like widescreen and color may have affected conventional framing in classical Hollywood cinematography.

The next post will look at various alternative approaches others have advanced for the empirical study of shot scale and framing. Doing so will highlight both the benefits and drawbacks of my machine-learning-based approach. Subsequent posts will dive in-depth to explore the data.