Character-Centered Video Story Understanding with Hierarchical QA

The DramaQA, which has been proposed for development of video story understanding artificial intelligence, consists of cognitive-based difficulty levels for QA as a hierarchical evaluation metric. Also, it provides coreference resolved script and rich visual metadata for character-centered video.

SNU AI

Published in

SNU AIIS Blog

9 min readMar 26, 2022

By Seyeon An

“Stories are the communal currency of humanity,” according to Tahir Shah. Stories have existed since humans have existed, and they will not cease to exist until humans do. They are indispensable tools for conveying what we see, hear, feel, and know. Stories can be transmitted via word of mouth — but there’s definitely many more ways — as novels, cartoons, plays, and films. Humans not only listen to stories, they create them themselves using these creative mediums. This is why the story understanding ability is a crucial part of human intelligence that sets humans apart from others. Such differentiation indicates that the capacity to understand stories like humans could be a proper medium when developing human-level AI. Especially, drama, typically in the form of video, is considered as a proper medium, since it conveys a story via human senses as sight, hearing, and actions.

In this post, we introduce DramaQA, which has brought great progress in enabling the computer’s understanding of a complicated story, as commonly observed in a drama. This data set contributes in solving the problem of what computer vision and natural language processing have not been able to handle until now. The ability to understand stories, that was originally conceived as the communal currency of humanity, can be expanded to the currency of computers as well, if DramaQA is properly used for further research. We have open-sourced the full dataset and moreover, we have held challenges that encourage further artificial intelligence development via our data set.

A Quick Overview of the DramaQA Dataset

For such video story understanding research, the DramaQA dataset was collected on a popular Korean drama Another Miss Oh, which has 18 episodes — 20.5 hours in total.

This dataset contains 23,928 various length video clips which consist of sequences of video frames (3 frames per second) and 17,983 multiple choice QA pairs with hierarchical difficulty levels.
It also includes rich character-centered annotations such as visual bounding boxes, behaviors and emotions of main characters, and coreference resolved scripts.

The figure below shows an overview of the DramaQA dataset.

*An example of DramaQA dataset which contains video clips, scripts, and QA pairs with levels of difficulty*

QA sets classified by levels of difficulty

The ability to understand stories and to answer questions on them differed accordingly from the stages of cognitive development. To collect question-answer pairs with levels of difficulty, we propose two criteria: Memory Capacity and Logical Complexity.

Memory Capacity (MC) is defined as the required length of the video clip to answer the question, and corresponds to working memory in human cognitive process.
Logical Complexity (LC) is defined by the number of logical reasoning steps required to answer the question, which is in line with the hierarchical stages of human development.

Memory Capacity : In the perspective of machine learning, the longer the video is and the more data it contains consequently, the harder it is to reason the answer from the video. The VideoQA problem considers two levels of memory capacity: shot and scene.

Level 1 (Shot) : The questions for this level are based on video clips mostly less than about 10 seconds long, shot from a single camera angle.
Level 2 (Scene) : The questions for this level are based on clips that are about 1–10 minutes long without changing location.

Logical Complexity : Complicated questions often require higher logical reasoning steps than simple questions; thus, the VideoQA set considers logical complexity of the problem as one other standard of measuring difficulty. The DramaQA set define four levels of logical complexity from simple recall to high-level reasoning, similar to hierarchical stages of human development.

Level 1 (Simple recall on one cue): The questions at this level can be answered using simple recall; requiring only one supporting fact, represented as triplets in form of {subject-relationship-object}.
Level2 (Simple analysis on multiple cues): These questions require recall of multiple supporting facts, which trigger simple inference.
Level3 (Intermediate cognition on dependent multiple cues): The questions at this level require multiple supporting facts with time factor to answer.
Level 4 (High-level reasoning for causality): The questions at this level cover reasoning for causality which can begin with “Why”. Reasoning for causality is the process of identifying causality, which is the relationship between cause and effect from actions or situations.

From these two criteria, the hierarchical difficulties for the DramaQA dataset is defined into four hierarchical difficulties, which is further illustrated in the figure below:

Character-Centered Video Annotations

One other indispensable aspect when it comes to understanding stories are characters. As characters are one of the three primary components of a story — with settings and events being the rest — it is essential to focus on characters to understand and convey the stories accurately. Thus, the DramaQA provides rich annotations for the main characters in the video contents. As visual metadata, main characters are localized in the appeared image frames sampled in video clips and annotated with not only the character names but also behavior and emotion states.

Visual Metadata

Bounding Box: In each image frame, bounding boxes of both a face rectangle and a full-body rectangle for the main characters are annotated with their name. In total, 20 main characters are annotated with their unique name.
Behavior & Emotion: Along with bounding boxes, behaviors and emotions of the characters shown in the image frames are annotated. Including none behavior, total 28 behavioral verbs, such as drink, hold, cook, are used for behavior expression. Also, we present characters’ emotion with 7 emotional adjectives; anger, disgust, fear, happiness, sadness, surprise, and neutral.
You can check a list of person_id, behavior, and emotion in here.

Here is an example of json file:

{
    "frame_id": "AnotherMissOh17_013_0261_IMAGE_0000021778",
    "persons": [
        "person_info": {
            "behavior": "stand up",
            "face_rect": {
                "min_x": 427,
                "max_x": 498,
                "max_y": 234,
                "min_y": 124
            },
            "full_rect": {
                "min_x": 330,
                "max_x": 569,
                "max_y": 617,
                "min_y": 74
            },
            "emotion": "Sadness",
        },
        "person_id": "Jiya"
        }
    ]  
}

Coreference Resolved Scripts

When reading a novel, we recognize that there are numerous coreferences of the main characters, as pronouns to refer back to the aforementioned characters (e.g. he/she/they). It is completely the same in dramas as well. It is essential to completely grasp who these coreferences are referring to, as in understanding “Who is talking to whom about who did what?” In DramaQA, these coreferences are resolved to enable the computer’s full understanding of the story.

Here is an example of json file:

"AnotherMissOh01_001_0109": {
    "contained_subs": [
    {
        "et": "295.595",
        "speaker": "Haeyoung1",
        "st": "293.685",
        "utter": "I(Heayoung1) said I(Heayoung1)'m not going to get married."
    },
    {
        "et": "292.426",
        "speaker": "Deogi",
        "st": "290.376",
        "utter": "Just what in the world are you(Heayoung1) trying to say now?"
    }],
    "et": "294.6",
    "st": "291.56"
}

Multi-level Context Matching Model

To utilize DramaQA to its fullest, we propose Multi-level Context Matching model. An ideal QA model should hieraarchically understand the multi-modal story, while using the character-centered annotations as its cue.

As illustrated in the figure below, our model contains two streams and two levels. The two streams enable evaluation on both the visual inputs and the script. The two levels enable scoring for both low-level and high-level representations. The low-level representations imply the context of the input stream with annotations related to main characters. From low-level representations, we get high-level representations using character query appeared in QA. Then we use Context Matching module to get a QA-aware sequence for each level. Outputs of these sequences are converted to a score for each answer candidate to select the most appropriate answer.

*Our Multi-level Context Matching model*

Comparison with Other Video QA Datasets

The DramaQA dataset:

provides difficulty levels of the questions,
provides rich information of characters including visual metadata and coreference resolved scripts — which aids story understanding research,
aims for the hierarchical understanding of story by tackling both shot-level and scene-level video clips

We also present a comparison of our dataset to some recently proposed video QA datasets:

As the table demonstrates, only DramaQA dataset provides hierarchical QAs from shot-level and scene-level videos and character-centered visual metadata (bounding box, name, behavior, and emotion). It is also notable that it provides the most # annotated images.

ECCV2020 VTT Workshop & DramaQA Challenge

To verify the DramaQA dataset and the evaluation scheme that it provides, the DramaQA challenge has been held by the European Conference on Computer Vision (ECCV) along with the VTT Workshop. The ECCV is considered as a top-tier computer vision conference, along with the International Conference on Computer Vision (ICCV) and IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

The VTT Workshop aims for human-level artificial intelligence development via measuring the levels of machine intelligence for video story understanding. Moreover, there was a discussion on the data-based video understanding task with six renowned presenters. The DramaQA Challenge has been held together in the VTT Workshop. The DramaQA Challenge is a challenge in which the proposed data set and its way of evaluation is internationally verified. Due to COVID-19 issues, the conference has been held virtually in 2020, and the following was the timeline:

2020. 3. DramaQA Dataset Paper Registration
2020. 4. Release of Starter Code, Data Set Presentation (https://dramaqa.snu.ac.kr/Dataset)
2020. 7. Participant Submission
2020. 8. Announcement of Winning Teams

The DramaQA Challenge sets the average of the response accuracy on the questions of four levels of difficulties as its standard of evaluation. It is to evaluate the AI model’s ability to understand the stories in each hierarchical level. Nine teams from four countries (Republic of Korea, China, India, and Germany) have participated, and a $1,200 reward each has been given to the three winning teams (GGANG from Seoul National University, SUDOKU from Xidian University, and HARD KAERI from Korea Atomic Energy Research Institute).

More can be found in the DramaQA Challenge website.

Future directions

The application area of the proposed DramaQA dataset is not limited to QA based video story understanding.

Our DramaQA datatset could be applied to a lot more — other than QA based video story understanding — such as but not limited to:

emotion or behavior analysis of characters
automatic coreference identification from scripts
coreference resolution for visual-linguistic domain

Story understanding was never easy for artificial intelligence, since stories were considered only as languages of humans until now. Despite the challenges around developing intelligence that completely understand stories, we would expand the two criteria of hierarchical QA so that the data set can deal with longer and more complex video story along with expanding the coverage of evaluation metric. Also, we plan to provide hierarchical character-centered story descriptions, objects, and places. Our ultimate goal is to make DramaQA encourage other inspiring works in the video story understanding domain of artificial intelligence development.

Acknowledgements

We thank Seongho Choi and the co-authors of the paper “DramaQA: Character-Centered Video Story Understanding with Hierarchical QA” for their contributions and discussions in preparing this blog. The views and opinions expressed in this blog are solely of the authors.

This post is based on the following paper:

DramaQA: Character-Centered Video Story Understanding with Hierarchical QA, Seongho Choi, Kyoung-Woon On,Yu-Jung Heo, Ahjeong Seo, Youwon Jang, Minsu Lee, Byoung-Tak Zhang, Thirty-fifth AAAI Conference on Artificial Intelligence, 2021. arXiv, Project Website

Originally posted on our Notion blog, at Mar 19, 2021.