Can TFIDF be applied to Scene Interpretation?

Published in

Shallow Thoughts about Deep Learning

4 min readDec 28, 2018

Fair warning: I’ve done absolutely no research to see if this is currently being done or possible. So a lot of what I’m proposing may be simply… wrong. But here we go…

Overview

What I’m proposing is a mechanism for a computer system to “understand” what is happening in a video by utilizing a concept well known to NLP practitioners. By focusing only on the parts of a particular scene that are.. unexpected. Consider Claude Shannon’s Information Theory in which information is more valuable if it’s unexpected (disclaimer: this is a GROSS generalization and oversimplification). If a scene in a movie has people, dialogue, setting, etc. that is abnormal for movies in general, it can be said to be “important.”

What is TFIDF?

In the domain of Natural Language Processing (NLP), there exists a technique called “TFIDF”, which stands for Term Frequency-Independent Document Frequency. The purpose is to determine how important a specific word (or phrase) is to a document relative to a corpus (set of documents).

The equation above gives a score to each term for a given corpus. But intuitively, if you have a corpus of speeches and you want to see how important the term “dream” is for a particular speech, you can get the tfidf of that term by counting how often it appears in that speech (the tf part) and then seeing how INFREQUENTLY it appears in the rest of the speeches (the IDF part). Those values together give you some idea of how important a particular word is in a specific document.

Object Recognition and Segmentation

There are currently many ways for a neural network to detect various objects in an image. I won’t get into all of them, but as an example, we can use something called an R-CNN (Region Convolutional Neural Network) to segment an image and identify various objects. Which means that given a particular image, we can break it apart into regions and classify what items are in each reason. This in itself is pretty cool, but I kept thinking about what it would mean to extend it a bit.

The Corpus

Imagine taking hundreds of movies across various genres and indexing each frame to find all objects in all frames. The level of description used for an item is arbitrary, but can be important.

Figure 2: Objects in a scene from The Matrix

In this picture, we have many, many, many objects. We have: a phone outlined in blue, a birdcage outlined in red, sunglasses outlined in green and of course a ton more that hasn’t been outlined. To calculate the “TF” for the phone, for example, we would count how many scenes have a phone in it. That would give us the “term frequency” for the phone. To calculate the “IDF” for the phone, we would count how many scenes has a phone in EVERY movie in our corpus.

Styles and Specifications

Figure 3: Cell phones and Landline in The Matrix

Of course, fans of The Matrix will recognize that there are MANY kinds of phones used in that movie, each with a different significance. There are cell phones, land line phones and even phones in booths. So the database may have to specify the specific kind of phone that is in the scene. A taxonomical definition might be helpful in this case to specify counts for “phones,” “cell phones,” “phonebooths,” and so on separately.

However, in the case of the sunglasses, they certainly are cool, but the level of detail needed to describe their importance to the plot is insignificant.

Relative Importance

For a machine to understand a scene, it has to figure out what that scene is “about.” I think a TFIDF analog might be a good way to start with that. It can also work for speech, but I assume a lot of that work is already done. In fact, you can use a script and just TFIDF it like crazy — but that’s not as interesting 😀.

Fractal Effects

This can be further extended to get term frequencies and inverse document frequencies not just across an entire movie, but thinking of “documents” as a particular scene or setting in a movie. The frames can be documents, the corpus can be scenes. Or any combination that makes sense (locations, actors in a scene, etc).

Thoughts

This is purely hypothetical and I’ve done no work to see if this is feasible / useful. It may already be done by smarter minds than my own and I welcome comments pointing me to the relevant research.

To read more about deep learning, please visit my publication, Shamoon Siddiqui’s Shallow Thoughts About Deep Learning.