Representing Visual Data in a Linguistically-Coherent Way
By Audrey Beard
What’s this all about? 🤷♀️
Let’s talk about actions. They’re not entities in the same way objects are, they’re transformations. Think about a slice of pizza — you’ve probably got a pretty solid mental image, or prototype, of it. Now think about the action “running” — maybe you’re thinking of a jogging person, a galloping horse, an open water faucet, or an idling car. These examples demonstrate the difference between entities and actions — entities can be generalized into some prototype* that’s mostly consistent, whereas actions transform entities, and can look totally different based on what is performing them.
So far, actions seem to be pretty similar to attributes — a topic that’s enjoyed a lot of attention from the computer vision community (S/O to Tushar Nagarajan and Kristen Grauman). However, actions are not attributes. Actions may be transitive or intransitive — think of the difference between “I sleep” and “I eat pizza”. “I sleep” is an intransitive action, which means it is performed by me (the subject) and doesn’t have an explicit object. On the other hand, “I eat pizza” is a transitive action, meaning it’s performed by me (the subject) and performed on pizza (the object). In English, transitive actions are typically arranged in subject-verb-object order, so we’ll just call them SVO actions. We’ll talk about intransitive (SV) actions later, so just bear with me.
Gimme the deets! 👇
Ok, so we’ve got this framework for thinking about SVO actions — now let’s talk about how we can represent them. If subjects and objects are entities and can be “prototyped,” then we can imagine them as points in a coordinate system. Verbs, on the other hand, aren’t represented as points, but rather as transformations on these points, since verbs are things nouns do. Verbs affect subjects and objects differently, so the distinction between them is important — “person grooming dog” isn’t the same as “dog grooming person.” This is true linguistically, visually, and semantically, so our approach emulates that. Below you can see an example of how the base verb “groom” affects the subject and object differently:
In the above example, we can see that “person” and “dog” both sit somewhere in this 2D space. By transforming them by the subject-version of “groom” (in grey) or the object-version (black), we move them to a different point in space (blue, red, and green). You could think of this as the difference between “grooming” and “being groomed,” since the subject does the grooming and the object is what is being groomed.
SVO actions are the combination of subject-verb and object-verb pairs — “person grooming dog” is simply the combination of “person grooming” with “dog being groomed.” Similarly, “person grooming person” is the combination of “person grooming” with “person being groomed.” Check out this example below:
In this example, we combine the previously-found SV and OV points to create two SVO actions. You may be thinking, “what about intransitive actions” — I assert that they really aren’t that different from transitive actions, they’re just missing an object. With that in mind, we can just treat them as special cases of SVO actions, with O=0.
This is the core of what I’ve been working on at TRASH this summer — representing actions as combinations of subjects, verbs, and objects. My work combines linguistics, computer vision, machine learning, and natural language processing to allow us to model actions more richly than “running” or “grooming” — in a way consistent with our linguistic grammar of action description.
Ok, But why?
This gives us a really neat way of finding related videos and exploring the space of possible actions! By finding the SVO identity of an input video and tweaking it along one of those dimensions (subject, verb, or object), we can explore the space of similar actions in video much faster than we could manually. Whether it’s a scene of two people talking in a coffee shop, a football game highlight reel, or a music video, editors often create transitions between two related shots. Since editing takes so long and much of it is composed of finding relevant shots, this can give editors the freedom to spend less time digging through hours of footage and more time composing that footage in creative ways!
My name’s Audrey, and I’m a Ph.D. Research Intern at TRASH. I’ve spent this summer exploring methods of representing visual data in a linguistically-coherent way. To do this, I’ve focused on video action recognition, learned metric embeddings, and complex loss functions for zero-shot learning and re-identification. I’ve been really excited about the creative potential of my work, and producing video supercuts is a fun way of exploring the space. I’ve worked with metric embedding learning before, so this project was a great combination of familiar tactics and novel ideas. Helping me out with this research is Dr. Geneviève Patterson. I’m studying computer science at Rensselaer Polytechnic Institute with Dr. Charles Stewart, and you can find me on GitHub, Twitter, LinkedIn, or email. Thank you to TRASH for making this work possible!
*Many science and technology studies researchers have discussed the problems caused by reducing all instances of a class to one prototype. While it is very often problematic, it can be useful for challenges like this.
– Audrey, Gen (& Team TRASH)