Moviegoer: Four Categories of Comprehension

Tim Lee
The Startup
Published in
4 min readAug 10, 2020

This is part of a series describing the development of Moviegoer, a multi-disciplinary data science project with the lofty goal of teaching machines how to “watch” movies and interpret emotion and antecedents (behavioral cause/effect).

Continued progress has helped clarify the overall goals of the project — we’ve identified four broad categories of knowledge that Moviegoer must identify and recognize. These categories aren’t tied to any specific aspect of the tech stack, and advances in one category may support another. Much like a human viewer, a machine must be able to parse four categories of comprehension to “watch a movie”: film structure; characters; plot and events; and emotional and style features.


An individual scene is a granular, self-contained component of every film. It has a fixed location, a set number of characters, and conveys one or more story beats. A scene can be analyzed individually, or compared against other scenes. Earlier in the project, we created an algorithm to identify a specific type of scene: the two-character dialogue scene. But we’ll need to be able to divide the entire film into its individual scenes.

We’ll also want to divide the film into its eight sequences. Many films follow the eight-sequence-approach, which can be thought of a more detailed breakdown of the three-act structure. These eight sequences, each lasting roughly 15 minutes in a two-hour film, denote (broadly) when major plot points are supposed to be unfold and when new characters might be introduced. Each of the eight sequences ends in a climax — this could be an important clue when identifying major plot points.


We’ll need to persistently track characters throughout the entire film, to track their events and emotional changes. We can look for the vectorizations of their face and voice throughout the entire film, locating in which scenes they appear. We’ll also need to attribute dialogue to each character, using NLP on the subtitles to understand what they’re saying.

Films elicit responses through their characters’ emotions, and we’ll also need to monitor their emotions throughout the film. We can track their ups and downs through analysis of their facial expression, voice tone, and word choice, and see what antecedents triggered those emotional changes.

Plot and Events

A plot consists of many different events and happenings. We’ll need to use context to understand where a scene is taking place, and what’s happening. Maybe an outdoor scene on a boat can be identified by the sound of waves crashing. Dialogue with a previously-unknown character about appetizers or entrees may hint a character is ordering with the waiter at a restaurant.

This particular category might be the most difficult to populate, and its conclusions might be filled with qualifiers and “best guesses”.

Emotional and Style Features

Emotional and style features are somewhat “intangible”, and subject to interpretation. These are directorial choices used to elicit specific emotions in the audience. Music score is the most prominent example — although we understand this music doesn’t actually exist within the scene, it’s been layered on top of it to make the audience feel sad, excited, tense, or a multitude of other emotions.

Color and brightness can easily be quantified with computer vision. Dark scenes might be moody or uneasy. A scene tinted blue is “cool”, and the location or situation could be inhospitable or foreign.

The cinematography, or shot choice, can also be scrutinized. A character’s face may fill the frame to emphasize a facial reaction, or we may see his entire body from afar to emphasize loneliness or emotional distance. A shot might be looking down at a character to make her seem powerless, or looking up at her for the opposite effect.

In “Rides Start at 10:00”, distance is used to emphasize loneliness

This category of comprehension is simultaneously the most powerful as well as the most debatable. Recognizing these clues (and coding them into the project) relies heavily on domain knowledge in filmmaking. These empirical rules have evolved from over a century’s worth of advances in filmmaking, and require a strong understanding of the craft. At the same time, some directors will consciously flout these rules as an artistic choice, and Moviegoer must be ready to accept these scenarios. But, if a style rule helps us interpret emotion in 99% of films, across all genres, it’ll greatly help in interpreting films.

Wanna see more?



Tim Lee
The Startup

Unlocking the emotional knowledge hidden within the world of cinema.