Using AI to understand the way a movie “looks” and “sounds” like

Theodore Giannakopoulos
Behavioral Signals - Emotion AI
9 min readJan 15, 2018

What movie should I watch tonight? A question we make several times, either explicitly — to a friend whom his/her movie taste we trust — or implicitly, by using a movie-related website (IMDB) or a content platform (Netflix). Happily, recommender systems (RSs), used within all modern streaming platforms, manage to provide us with movies we will probably like. But how do RSs work?

RSs are information processing methodologies that focus on extracting the best recommendations of items to users. For example, items can be movies and users can be customers of a content platform. One way to achieve recommendation is by using collaborative knowledge: a user is matched with other users based on her/his past behavior and then this matching is used to predict future preferences. Collaborative filtering is based on the assumption that users that had similar preferences in the past will also share similar preferences in the future. On the other hand, content-based RSs utilise information that stems from discrete characteristics of the movies, such as: directors, genres, locations and actors. Finally, hybrid RSs combine collaborative and content-based methodologies.

However, most of these methods rely on human-generated content (either past user preferences or manually labelled content tags) and do not take into account the raw content of the movie itself. Can AI be used to analyse a movie’s raw content (i.e. its subtitles, sound and video), in order to discover knowledge regarding the way a movie “sounds” and “looks”? Such knowledge could obviously boost the performance of movie recommendation systems and also provide explanatory results about the user preferences. I will present here, some of the findings presented in this paper, where the application of common computer vision, audio analysis and text mining methodologies on raw visual, audio and textual information from movies has been proven to lead to better estimations of movie similarities.

Text can be directly extracted from the subtitles and needs no sophisticated analysis and preprocessing (related to the audio and visual domains): after a basic parsing, to remove unwanted information such as markups and timestamps, non-informative words that do not add to the distinctiveness of the films are also excluded (stop-word removal). Also, lemmatisation is usually performed, in order to reduce inflectional forms.

After this preprocessing process, each movie can be considered as a bag of words. Based on that, each movie can be further represented, either as a vector of term frequency-inverse document frequencies (tf-idf), or through more sophisticated methods that reduce the representation dimensions. Tf-idf is actually a simple weighting scheme, according to which the words of the documents are allocated a weight denoting their importance for the specific document. Instead of this simplistic approach, more complex methodologies can be adopted, such as the Latent Dirichlet Allocation (LDA), which is a probabilistic generative model structured upon the idea that all documents (movies) can be thought of as a mixture of specific topics. Each topic is a distribution over the words in the global vocabulary of the collection of documents (movie subtitles). The following figure illustrates one of the topics extracted from around 150 movies, as a word cloud. The size of each word is proportional to the importance of the word for this topic. This topic is obviously associated to war movies:

This topic modelling approach can be used to represent each movie as a combination of various topics, i.e. groups of words with different weights, combined in a common semantic unity. The usefulness of the learned topic model in grouping similar movies together, based on their relevance through specific topics is demonstrated in the following figure, where movies are grouped together as co-thematic, according to the respective topic extracted from the subtitles text:

This topic modelling approach offers a very strong dimension of similarity between movies, based on the topics that exist in the subtitles, therefore in the plot of the movies itself. Next, we are going to see lower-level features that stem from the sounds and the visual information of the movies, and how these features are associated to the movie’s content.

We have talked about content characterization of movies based on their subtitles: it is obvious that if the words “fight”, “marine” and “sergeant” appear in two movies, then these movies share a similar topic related to war. But what about lower-level cues of the movies that describe not only their content but their style? There are certain attributes of a film that makes us (dis)like it and which are not necessarily related to topics or metadata (actors, genre, director, etc). In other words, two films can be similar in terms of cues such as music soundtracks, audio effects and camera movements.

Let’s start with the audio information of a film: I am sure you will have noticed that when another member of your family is watching a movie and you are in another room you can tell the movie’s genre or even it’s sentiment, despite the fact that you are not watching it but listening to it. Music background themes, music tracks, sound effects, dialogs and background sounds all play a vital role in the movie’s style. The graph below shows how some famous movies are distributed with regards to their music genres: 3 musical genres have been used to illustrate this distribution namely: rock, electronic and classical. You can see how PI (1998) has a soundtrack of almost 100% electronic music, while the soundtrack of the Matrix sequel is equally distributed between rock and electronic, etc.

Such information can be directly extracted from the movie’s audio signal using temporal and spectral feature representations and supervised machine learning algorithms. Using similar audio analysis approaches, a film can be characterized with regards to its speakers, audio events (machinery, gunshots, crowds, screams) and even emotional states (based on speech).

Beyond doubt, visual information can be considered as the richest domain of a movie. There are particular low-level visual features that express latent semantic attributes that discriminate between different cinematic techniques and film contents. First, the adopted colors of a film play a vital role in the director’s effort to enhance the mood or to punctuate a dramatic tone in the movie. Differentiations in the color or the illumination of a film can be either due to the illustrated subjects or to an artistic process, since digital color correction is deliberately applied to convey an artistic perspective. In other words, in movies colors do not just reflect what is being illustrated but also how it is illustrated. The following figure shows screenshots from famous movies where one of the RGB (Red, Green, Blue) color channels is highly dominant.

Rows 1–4 are movies with red as a dominant color (In the mood for love, Lock Stock and 2 Smocking Barrels, Godfather II and Django Unchained). Rows 5–8 to movies with green (The Matrix, The Matrix Reloaded, Pirates of the Caribbean: At World’s End and Fight Club) and rows 9–12 to blue movies (Finding Nemo, Star Wars Episode V — The Empire Strikes Back, Aliens and Blade Runner).

In the above examples only for the case of “Finding Nemo” there is an obvious reason that there is a dominant color (blue in that case), which is directly associated to the movie’s content (this movie is an animation and its story is being played at the ocean). In all other cases, the selection of the dominant color corresponds to an intentional choice made by the producers to express either meaning (e.g. red is usually selected to express violence and sin), mood or even a particular era (warm colors are adopted usually in movies set in the 60s or 70s). Finally, in some cases the adoption of a dominant color expresses a very particular plot concept: in Matrix, the green color choice refers to the monochrome monitors used in early computing and it is used to discriminate from the “real” word. Such discriminations can be easily modelled using simple statistics and histogram calculations among the raw color values of each frame. Similarly, illumination and saturation statistics (either as simple averages or as centroids of clusters extracted through unsupervised learning) can discriminate between “dark” and “light” movies, or between “saturated” and “unsaturated” movies, as shown in the figures bellow:

Two very dark (Sin City and Dr Strangelove) and one very light (PI) movies
Machinist is an example of a very unsaturated movie. The secret in their eyes, on the other hand is a very saturated movie

Unsupervised learning and temporal analysis can be used to extract changes in the illumination or the saturation of a movie itself. This can be used to discriminate movies with more “stable” color characteristics, against movies with abrupt changes in their color characteristics. Consider Kill Bill for example, where the diversity between monochrome (unsaturated) and extremely saturated frames is usual:

Along with color, motion is the most important visual characteristic of a film and highly differentiates between different genres and filming techniques. Motion patterns can change either due to the subjects’ movements (therefore depend on the types of action) or due to the camera’s movement methodology. Motion patterns can be modelled through the extraction of the optical flow which has been widely studied in computer vision and video coding. Through the estimation of the flow vectors, one can use supervised machine learning to classify the movement of the camera to particular classes of cinematographic techniques such as pan (camera rotates horizontally from fixed position), tilt (camera rotates vertically from fixed position), pedestal (camera moves on the vertical axis, without movement on the horizontal axis) and truck (camera moves left or right without change in its perpendicular location). The following figure represents a typical example of a horizontal camera movement and how is this affecting the extracted flow vectors.

Panning scene example and respective flow vectors (green vectors) extracted using and optical flow algorithm implemented in Python. From the intro of the Cowboys and Aliens movie

Another important visual characteristic in cinematography is the existence of faces and the way they are illustrated in films. Close-ups are often given to leading characters in films, in order to indicate their importance. Faces can be detected using computer vision methodologies. The relative size of a face close-up, the face’s orientation and as well as the number of faces that appear in a movie’s frame can discriminate between different cinematographic styles. For example, long close-ups to particular characters have usually a purpose to make the audience to engage with the character. Such attributes can be directly extracted as soon as the faces have been detected using computer vision methods applied on the movie’s video, using simple geometrical statistics.

As soon as faces are detected for each frame of a film, their relative sizes and orientation as well as the number of faces appearing in each frame can be used to characterise the director’s view

Apart from colors, motion and faces, film transition is an important procedure in cinematography. It is usually applied in the post-production stages and it combines shots (sequences of successive video frames captured without interruption from a single camera) and scenes. Shots transition is usually done by simple cuts, while visual transition effects are also sometimes adopted. Some directors use long takes, i.e. shots that are long than usual. The film Rope by Alfred Hitchcock is the first widely known film with many long takes. Automatic detection of shots in videos is a widely studied task in computer vision, usually taking into account abrupt changes in color and motion between successive frames. The following figure shows the distribution of 11 movies, with regards to their average shot length (x-axis) and their top-10 average shot length (y-axis).

Average shot length (x-axis) and average top-10 shot length (y-axis) for 11 movies. Movies with abrupt cuts and fast camera movements such as Run Lola Run and Trainspotting have very low values of both features. On the other hand, Angelopoulos’ The Suspended Step of the Stork is known for their slightest movements and changes, as well as long takes (very long shots)

This article provides a short list of examples of how multimodal signal analytics can be used to extract knowledge from low-level textual, visual and audio information of movies. This extracted knowledge can be used to estimate semantic similarities between movies, that utilise latent cinematographic and style characteristics. In this publication a small dataset of around 150 movies has been used to prove that features as the ones described in this article can offer up to 50% performance boosting to a film similarity system, compared to simple metadata alone. Such content representation approaches can open the way for holistic methods in film recommendation, talking into consideration information that is “richer” than simple metadata that have been manually provided by users or annotators, since such metadata just describe abstract film characteristics related to genres and topics or related to static attributes such as directors and actors. Additionally, the new available solutions in deep learning can offer more accurate and scalable methods for recognising multimodal content: deep neural nets can be used to extract more complex semantic characteristics regarding a film and this can now be achieved even on real-world datasets of thousands of movies.

--

--

Theodore Giannakopoulos
Behavioral Signals - Emotion AI

PhD in audio signal analysis and machine learning. Over 15 years in academia and startups. Currently Director of Machine Learning at Behavioral Signals.