Identifying Shot Scale with Artificial Intelligence

CineScale

Amos Stailey-Young
4 min readOct 13, 2022

This post is the third in a series. The code for this project can be found on my GitHub. The Google Colab notebook used to analyze the data can be found here.

As I discussed in my previous post, the folks at CineScale have released a new and unique dataset called CineScale that attempts to catalog shot scale. They have trained a Convolutional Neural Network to classify images as either close, medium, or long shots. And because they’ve released this model to the public, I can use it on my own sample of Hollywood films. This post examines the results of applying the CineScale model to a representative sample of approximately 3500 films produced in the US between 1915 and 1970.

CineScale uses machine learning (i.e., artificial intelligence) to label images as a close, medium, or long shot.

Example of the data collection process.

CineScale used a “supervised learning” approach where two human “annotators” manually labeled approximately 792 thousand frames according to the schema above. CineScale then trained a Convolutional Neural Network that can automatically classify any image on this dataset.

The model is available for download on the CineScale website. I’ve written about my film sample in another context for the curious.

Unfortunately, there isn’t much to discuss here. CineScale is a novel and innovative application of artificial intelligence and machine learning to the study of cinema. Film studies, and the digital humanities more broadly, can benefit significantly from projects in the data analysis space like CineScale. However, the “resolution” of the model is too low to capture any information about how shot scale has changed historically. Further, the results of my analysis cast some grave doubts on the integrity of the CineScale model.

Whether we examine our sample by year of production, genre, or studio, we get essentially the same proportions between close, medium, and long shots. The variation is negligible and is explicable by randomness. Because the plots (shown below) are so similar, I’ll discuss them as a group.

It only looks like “Long-Shots” are excluded because the values are so small relative to the other shot categories.

Because the data does not seem to vary according to any feature of the dataset, the only conclusion we can draw is that the ratio of the shots in American cinema between 1915 and 1970 remained relatively unchanged. Such a conclusion is actually surprising because it seemingly contradicts the subjective experience of many researchers. Somewhat speculatively, I would say that most scholars would assume that the close-up rate would differ according to historical period, whether due to the introduction of sound or the emergence of widescreen. The model does not predict this, however. So we have a discrepancy between what scholars anticipate from the data and the data itself. How do we account for this discrepancy?

There are two kinds of reasons why the data may not vary according to feature. Either the model is accurate and the data does vary little in actual reality, or the model is inaccurate and misses changes that actually happen. Let’s be charitable and assume for the moment that the model is correct. What would explain the lack of change in the ratio of shot scales?

If the model truly captures some aspect of reality, then we may have something akin to a “golden ratio” of shot proportions. It’s somewhat unclear why such a ratio might exist, however. Is it some consequence of human perception? If that’s the case, maybe we’ve stumbled upon an artistic or cultural “law.”

The other way to explain the low variance is that the model does not adequately capture the phenomena it purports to represent; in this case, shot scale. For adherents to Occam’s razor, that explanation is much more likely than the first one. There is also some compelling evidence to support it.

Machine learning models, and Artificial Neural Networks like CineScale especially, are often accused of operating like “black boxes” since even their creators have difficulty explaining how they work. A lot can go wrong with these models in ways that are hard to predict.

The most concerning result from the CineScale model is the exceptional rarity of “long shots” in the sample. The highest percentage of long shots in a film is around 0.4%, which means they essentially don’t exist as a shot class in American cinema, according to the CineScale data. By not regularly predicting long shots, this model has almost turned a multi-class problem into a binary one.

Further, the average values hover around 2/3 and 1/3 for medium and close shots respectively, which is suspicious. Since there are three classes in total, those fractions are special, which makes it questionable that the averages just happen to be coincidentally close to them.

Hopefully, I’m wrong about CineScale. At the moment, though, I have some serious doubts regarding the validity of their model. It’s disappointing, especially since I had such high hopes initially. But the lack of variance in the data makes it impossible to conclude much of anything. And when combined with the model dropping the “long shot” class and the proportions hovering around 1/3, it suggests the model has somehow “overfitted” the data. Perhaps I’m wrong, which I hope I am. Determining whether the model is overfit is not easy. But we can explore the issue in greater depth when examining the results of applying my method to my sample. That is coming up in the next post.

Previous Posts

Next Posts

--

--

Amos Stailey-Young

I work at the intersection between cultural history and data science, developing new analytical methods and strategies for use in the Digital Humanities.