Identifying Shot Scale with Artificial Intelligence

Alternative Approaches

Amos Stailey-Young
9 min readOct 5, 2022

This post is the second in a series. The code for this project can be found on my GitHub. The Google Colab notebook used to analyze the data can be found here.

In my previous post, I introduced the topic of “shot scale” — dividing sets of images into named categories like “close-up” or “establishing shot” — and discussed my process for estimating it. But I’m not the only one to seek this goal. We must first consider two other compelling methods for identifying shot scale in cinema.

One approach is manual, requiring an individual to identify facial positions within the frame, and the other is automatic, relying on machine learning to estimate shot scale categories directly. Each approach has its benefits and its drawbacks.

Manually Collecting the Data

Representing the first approach is an essay entitled “The Framing of Characters in Popular Movies” written by James E. Cutting of Cornell University’s Department of Psychology. That author used software to click between the eyes of the most prominent characters within an image. Two variables were then measured: the number of characters per frame and their positions within the frame.

Figure from Cutting, James E. “The Framing of Characters in Popular Movies”, Art & Perception 3, 2 (2015): 191–212, doi: https://doi.org/10.1163/22134913-00002031

Cutting’s analysis should be more accurate than ours since humans surpass AI at recognizing faces within images (we measure accuracy by comparing results from actual people). And because Cutting’s data collection process was more rigorous, we can be more confident in his results.

We can clarify the problem by splitting errors into different types. Generally speaking, we should worry about two kinds of error: measurement error and sampling error. Cutting’s method benefits from lower measurement error because a human being, not an algorithm, directly identifies faces within the frame.

The downside to this approach, however, lies in its sampling error. Cutting’s data is more accurate (lower measurement error) because humans collected it. But acquiring all these images takes time — a lot of time. Because of this time scarcity, Cutting had to use much fewer frames when compared with an automatic method. And this scarcity brings its own problems.

Cutting collected between 247 and 508 images, 16063 total, per film for his sample. He does not, however, provide any information about the margin of error, but we can estimate it.

Sampling error measures the difference between the sample statistics and the population parameters. Cutting’s project has sampling error at two different levels. In the first case, a set of frames represents a sample, and all the frames in that film the population. In the second, a set of films is the sample and the set of all films made during the time period under scrutiny is the population. Frame is to film as film is to corpus.

In the first case, we have to assume an infinite population since we do not know the frame count of each film within the dataset (at 24 frames per second, a 90-minute film has roughly 130,000 frames, which means we can safely assume an infinite population for our calculations). If we assume a 95% confidence interval, the sampling error per film is roughly between 4.5% and 6.2%, which is not too bad.

While the sampling error between frame and film is relatively low, it is much higher between film and corpus. Cutting took his frames from only 48 films produced between 1935 and 2010 — not even one film per year. The margin of error is relatively high at around 14% because of this scarcity. The “mean number of characters in a frame” would then fall within this 28% range at least 95% of the time. Because of this great uncertainty, we should be careful with our conclusions.

The downsides of the high sampling error make themselves quite apparent when Cutting tries to examine cinematic framing historically. Drawing historical conclusions between the year of production and the mean number of characters in a frame requires subsampling by year. But there is an enormous problem with doing this: he doesn’t have enough data.

As mentioned above, Cutting only has a sample of 48 films, which is not even one film per year. Instead, he took 3 films for each 5-year interval.

Figure from Cutting, James E. “The Framing of Characters in Popular Movies”, Art & Perception 3, 2 (2015): 191–212, doi: https://doi.org/10.1163/22134913-00002031

One doesn’t need a degree in statistics to see the problem. Neither do we really need to calculate the margin of error. But as a thought experiment, let’s see if we can estimate the sampling error. To be generous, let’s use a lower confidence level, say 90%. The number of movies the film industry produced varied but probably averaged around 200 or 250. To be charitable, let’s say that, on average, Hollywood only made 100 movies each year. Way too low of a figure, but let’s assume so for now.

The above parameters are purposefully cautious. But when we calculate the margin of error according to those parameters, we still get a sampling error of ~47%. Because of the absurdly high margin of error, we simply cannot conclude anything with a relative degree of certainty. That said, Cutting’s plot actually does look like it’s measuring a real trend; the historical effect may be so overwhelming that it shows through even with such small sample sizes.

Nevertheless, Cutting’s analysis is quite valuable as long as it is not subsampled (such as by year). While he cannot make any conclusions about historical changes with any certainty, Cutting can illustrate the average framing throughout Hollywood cinema — at least from 1935 to 2010.

Figure 6. The distributions of locations for one-, two-, and three-character shots normalized to
the image frame of a 1.37 aspect ratio in movies with aspect ratios of 1.37 (12 movies), 1.85
(9 movies), and 2.35 (21 movies). Displays represent smoothed areas of character positions
divided by the 80th, 60th, 40th, 20th percentiles of density. That is, the darkest areas have occurrences of individual characters whose density across all single-character images is greater than
the 80th percentile, and the lightest areas have occurrences whose density is less than the 20th
percentile. From Cutting, James E. “The Framing of Characters in Popular Movies”, Art & Perception 3, 2 (2015): 191–212, doi: https://doi.org/10.1163/22134913-00002031

Moreover, Cutting explores the relationship between the number of characters, shot scale, and average shot length, which is intriguing. But we probably need further study to understand how these relationships function during specific historical periods.

Figure 4. The mediation of the effect of shot scale on shot duration by the number of characters
in the frame. That is, the effect of shot scale on shot duration (closer shots are linked to shorter
duration shots; Bordwell, 2006, p. 137) is significantly affected (mediated) by the number of
characters in the frame. In other words, the more characters the longer the shot scale, and in
turn the longer the shot scale the longer the shot duration. From Cutting, James E. “The Framing of Characters in Popular Movies”, Art & Perception 3, 2 (2015): 191–212, doi: https://doi.org/10.1163/22134913-00002031

Further, the data on framing allows Cutting to evaluate the applicability of the “rule of thirds,” a concept that scholars have applied to the study of cinema. Cutting says that the “data are not strong evidence in favor a [sic] ‘rule of thirds’ as applied to the framing of characters in popular movies.” (207)

While Cutting and I are both after similar goals — to track how shot scale and framing change over time — our methods can complement each other. In contrast to Cutting, my dataset has low sampling error because I have included approximately 3500 films in my sample (rather than 48). My project, however, almost certainly has a higher measurement error. While they may, at first, seem redundant, we can actually use the two approaches in concert to validate each other. Later on, we can see whether my results are in agreement with those of Cutting. But there is another critical way his data could help my project.

Most of the uncertainty in my project arises from not having a way to gauge my measurement error. I calculate the relative face size from the dimensions of the predicted bounding box. But the model makes mistakes. Sometimes it fails to detect a face when it is there (false negative). Other times it finds one that is not there (false positive). Measuring these errors requires comparing the identified faces to some ground truth. So to estimate our measurement error, we would have to collect frames and manually locate the faces within the image. Cutting has done just that. So, if he were to make his data publicly available, we could use it to estimate our measurement error.

Estimating Shot Scale Automatically

The folks at CineScale represent the automatic, algorithmic approach. In the previous post, I discussed two ways of automatically labeling images according to their scale categories: supervised and unsupervised learning. CineScale uses the first approach, supervised learning, by labeling thousands of frames as a long-shot, medium-shot, or close-shot. Then, a Convolutional Neural Network was trained on this dataset, producing a model that anyone can use to categorize images by their shot scale.

Example of Scale Classifications from CineScale.

CineScale claims 94% accuracy on their test set, which is quite high given the inherent ambiguity in labeling shot scale (i.e., there is a subjective component; different people disagree on the same image). But we can assume the true accuracy is at least somewhat less than 94% because images within the dataset (even the test set) are more likely to be similar to each other than images from outside the test set. We can also probably conclude that the measurement error for this method will be greater than that for Cutting’s approach since AI is less reliable on this task than human beings (the labels of actual people serve as the ground truth for these images).

CineScale is the first project I’ve encountered that applies machine learning to the study of cinema. Their work is significant and original enough that I’ll explore how it performs on my dataset in its own dedicated post.

As I discussed in my last post, I want to explore some unsupervised learning methods in classifying shot scales, and I think that CineScale may be able to help with that. But this is an enormously complicated approach that a later post will explore in-depth. Here, I only want to note how CineScale could help with this task.

Since relative face size is a continuous phenomenon, we can use algorithms to set thresholds that split images into scale categories of our choosing (theoretically, we can use as many scale categories as we desire). CineScale chose to use three scale categories (close, medium, and long), but seven seems to be the most common, such as is explained in the canonical introductory textbook to film studies, Film Art: An Introduction by David Bordwell, Kristin Thompson, and Jeff Smith:

The framing of the image stations us relatively close to the subject or farther away. This aspect of framing is usually called camera distance. The terms for camera distance are approximate, and they’re usually derived from the scale of human bodies in the shot. Our examples are all from The Third Man. In the extreme long shot, the human figure is lost or tiny (5.106). This is the framing for landscapes, bird’s-eye views of cities, and other vistas. In the long shot, figures are more prominent but the background still dominates (5.107). Shots in which the human figure is framed from about the knees up are called medium long shots (5.108). These are common because they permit a nice balance of figure and surroundings. The medium shot frames the human body from the waist up (5.109). Gesture and expression now become more visible. The medium close-up frames the body from the chest up (5.110). The close-up is traditionally the shot showing just the head, hands, feet, or a small object. It emphasizes facial expression, the details of a gesture, or a significant object (5.111). The extreme close-up singles out a portion of the face or isolates and magnifies an object (5.112).

The authors provide a handy figure that illustrates the different shot scales.

Figure from Film Art: An Introduction by David Bordwell, Kristin Thompson, and Jeff Smith.

The benefit to an unsupervised learning method is that we can always retrain the model for any number of shot scales whereas the supervised learning method CineScale uses can’t. To add a class to CineScale, we would have to essentially start over from scratch. There are numerous ways to classify shot scales, and while the system employed by Bordwell is probably the most common, the exact divisions are pretty arbitrary. In certain cases, it may be best to employ 3, 5, 7, or (possibly) even 15 different categories.

My next post will apply the CineScale model to my film sample and explore the results.

Previous Posts

Next Posts

--

--

Amos Stailey-Young

I work at the intersection between cultural history and data science, developing new analytical methods and strategies for use in the Digital Humanities.