Emotion-aware movie characterization with Oliver API

Content-based Movie Recommendation Systems

Content-based indexing methods have been helping us manage the huge amounts of multimedia data available online, by utilizing intelligent content search and recommendation. A popular application in that context is movie recommendation systems, which are based on

  • user references (collaborative systems) or
  • movie attributes that are statistically mapped to the user preferences (content-based systems) or
  • both (hybrid systems)

Today, state-of-the-art movie recommendation systems are either collaborative systems, such as MovieLens, or content-based systems, like jinni, or hybrid systems, as is IMDb. Content-based systems typically rely on human-generated metadata without taking into account the raw multimodal content of the movie itself, i.e. audio, visual and textual channels of the movie. To utilise the content of a movie, automatically extracting high-level content descriptors from movies’ multimodal signal is necessary for enhancing the content-based indexing process. Towards this end, a few research efforts have focused on extracting information from the audio, text and visual domains to model the dialogs, colors, movements and objects that appear in movies.


What about emotions in movies?

So multimodal information can be used to extract information in movies that can make the recommendation process more “insightful” as it will also take into account what we see and listen to when watching a movie. But what about the underlying emotions of movies? It is obvious that the emotions that are expressed by the actors can also play an important role in discovering a high-level content representation that will make the recommendation and search process richer. In other words, the negative, positive, strong and weak emotions that appear in a movie can influence our movie preferences.


Extract emotions from speech using Oliver API

Oliver is Behavioral Signal’s Emotion Artificial Intelligence (#EmotionAI) API. Developers can directly benefit from Oliver’s growing emotional intelligence, measure emotions and behaviors in conversations, and utilize our continuously evolving robust analytics in their own applications. Either that involves development of a virtual assistant (VA) for a business, an interactive game for children, a voice-controlled speaker for the home, or a social robot designated to assist the elderly, incorporating emotion-aware spoken language understanding will supercharge your users’ experience. One can use Oliver to send audio or video data and retrieve automatically generated behavioral annotations in JSON format.

The AI behind Oliver recognizes emotional states and behaviors based on the way people talk not what they say. This way of expressing emotional status can differentiate between movie types, directors and actors. A demo paper presented by Behavioral Signals in the 2019 Content-based Multimedia Indexing (CBMI) conference demonstrates how to utilize frame-level speech emotion recognition results, produced by Oliver API, in the context of a movie content characterization pipeline [1].


Emotion representation In particular, the core idea behind the paper titled “Using Oliver API for emotion-aware movie content characterization”, is to utilize the emotional predictions provided by Oliver for a set of recordings from famous movies, in order to demonstrate how emotional information can be used to differentiate movie content. Towards this end, the ML team at Behavioral Signals have selected 60 movies from 8 famous movie directors. For each movie, the audio recordings are sent to the Oliver API in order to get the frame-level emotional estimates. The two types of API responses used in this demo (ASR and emotions) are shared as JSON files here. The format of these files can be found on the Oliver Doc page. Each audio frame of each movie has been automatically labelled by Oliver API in one of the following 6 audio classes: non-speech, neutral, angry, sad, happy and ambiguous.

As soon as the sequences of emotion estimates are retrieved for each audio movie recording, we proceed with a simple aggregation of these emotions. Specifically, the percentages of each emotional class are calculated (including non-speech) in the whole movie. This leads to a 6-dimensional feature vector, one percentage for each audio class. This simple representation aims to demonstrate the ability of the API to produce “emotional signatures” for audio streams.

Text representation In addition to the speech emotions estimated by the API for each input audio stream, the ASR output has also been selected, in order to achieve text-based content representation. Towards this end, the text returned from the JSON ASR is parsed and GloVe embeddings is appied, which is a widely adopted unsupervised learning method used to extract vector representations at word level. In particular, for demo purposes the Behavioral Signals ML team have selected to use the 50-dimensional GloVe representation trained on the Wikipedia dataset (6B). These embeddings are used to create one vector per word. Finally, the generated words are averaged to form a fixed text representation accounting for the file as a whole.

Visualization To visualize the movie content in 2-D space, we chose to apply Principal component analysis (PCA) as a dimensionality reduction method from the two aforementioned feature representations (the 6-D speech emotion aggregates and the 50-D text-based representations) to the 1-D dimension for each initial representation method. In addition, we have selected to train a basic Support Vector Machine classifier on the final 2-D representation space, with director names used as ground truth labels, to illustrate the “decision surfaces” between the individual movie directors.


The following figures illustrate the 2-D emotiontextual content for the movies of the 8 directors we have in our dataset. For visualization purposes, we decided to divide the directors into two groups which are represented in each figure separately. As explained above, x-axis corresponds to the “emotional dimension” (how something is being said), as extracted by applying PCA on the emotional aggregations, while y-axis illustrates the textual representation (what is being said).

Note that the Coen brothers and Roman Polanski are illustrated as the most “compact” directors in terms of both dimensions. Also, movies from Aronofsky and Scorsese are most often “outliers”, considering both text and emotion. Finally, Woody Allen can be distinguished from Tarantino, Coen brothers and Aronofsky by only using the emotional representation, i.e. the horizontal PCA axis, with accuracy that reaches almost 100%.


In the future we will demonstrate how this emotional characterization of movies can be used in the context of a real-world movie recommendation system, that takes into account several sources of low-level content descriptions.


[1] Using Oliver API for emotion-aware movie content characterization, Theodoros Giannakopoulos, Spiros Dimopoulos, Georgios Pantazopoulos, Aggelina Chatziagapi, Dimitris Sgouropoulos, Athanasios Katsamanis, Alexandros Potamianos, Shrikanth Narayanan, Content-based multimedia indexing 2019, Dublin, Ireland

Behavioral Signals - Emotion AI

Building the fastest evolving robust emotionAI engine

Theodore Giannakopoulos

Written by

PhD in audio signal analysis and machine learning. Over 10 years in academia. Currently Director of Machine Learning at Behavioral Signals.

Behavioral Signals - Emotion AI

Building the fastest evolving robust emotionAI engine

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade