Quantifying Parent-Child Interactions: Advancing Video Understanding with Multi-Modal LLMs

Published in

Sage Ai

7 min readSep 26, 2023

Background

Sage has joined forces with Innovations in Development, Education, and the Mathematical Sciences (IDEMS) in an Ethical AI Hackathon, to tackle some of the most pressing issues facing Western Africa. One of the activities in the event was to enhance the analysis of Parent-Child Interaction (PCI) videos.

PCI videos serve as the premier research tool for studying parenting interactions with children below the age of 3. Conducted in thousands studies, these videos offer an insightful glimpse into parent responses towards their child’s behavior. During a typical session, parents engage in regular activities with their child, such as playing, feeding, or reading, for around 5 minutes at their homes. This interaction is recorded, after which trained professionals meticulously assess the footage. Evaluations generally focus on the child’s emotional state (e.g., happy, sad, angry), the parent’s attentiveness to the child’s actions, and the harmony between parent and child. Each aspect usually receives a single score summarizing the entire session.

Existing Solution

The widely-recognized open-source tool, OpenPose, transforms video content into a series of movement data. Following this, a Gated Recurrent Unit (GRU) model is deployed to derive parental responsiveness scores from the generated time-series data.

Challenges

OpenPose, a pose-estimation model, breaks down video frames and annotates detected individuals with a 25-point skeletal structure. However, it’s not without flaws. It sometimes detects non-existent individuals and may swap labels between frames (for example, labeling person 1 as 2 and vice versa in subsequent frames). To address this, researchers designed a data processing workflow in a Jupyter notebook, allowing them to scrutinize OpenPose’s results, rectify its errors both algorithmically and manually. However, this approach is labor-intensive, hindering the scalability of the workflow.

Dataset

Given data privacy considerations, the hackathon focused on the infant laughter dataset, which is approved for research. This comprehensive online experiment captured video footage from numerous families in their home environments. Using their laptops, parents performed five straightforward jokes for their infants, such as peekaboo, tearing paper, or placing a cup on their own head. Each joke was showcased three times, and parents documented if their infant laughed and their personal amusement level. The accessible dataset encompasses roughly 1,500 brief clips (spanning 10–20 seconds), each presenting a single joke performance and the baby’s response.

Requested Features

During the hackathon, the teams were asked to address these questions for our partner researchers:

Can participants develop a straightforward model to discern the type of joke and/or gauge the infant’s response?
Is it possible for them to tag occurrences of parental speech or detect infant laughter, aiding in model explainability?
Can teams utilize the dataset to craft a prototype workflow or user interface tailored for non-technical individuals?

Vision

We were interested in the recent development of multi-modal LLMs and inspired to offer our research partners a long-term solution for video understanding.

Specifically, we are answering the question: is it possible to harness diverse information modalities to gain a deeper understanding of the world around us?

In the case of video understanding, the data can be decomposed into audio, image, and text. Can we leverage multi-modal LLMs to understand PCI videos?

Transformer NNs can encode and/or decode data of different modalities

Hackathon Solutions

In a two-day sprint, we explored two innovative solutions for enhanced video understanding, both harnessing the capabilities of multi-modal LLMs.

Video LLaMa:

Video LLaMa leverages both the Video and Audio Q-Former

Video-LLaMA is a neural network architecture designed to understand videos in two main ways. First, it watches the frame by frame changes in what you see over time. Second, it pays attention to both the images and the sounds together. To do the first thing, it uses a lightweight transformer neural network called a Video Q-former to better connect pre-trained image encoder to the video encoder. For the second, it uses a method called ImageBind to understand audio, and then uses another transformer, Audio Q-former, to produce auditory embeddings for the LLMs.

Aims to craft a user-friendly workflow and interface designed especially for non-technical individuals using the dataset.
Facilitates cross-modal training by leveraging pre-trained visual and audio encoders in conjunction with pre-trained foundation LLMs.
Represents an evolutionary step forward from prior static image-centric vision-LLMs, such as MiniGPT-4 and LLaVA.

2. VideoChat with GPT4:

Perception Tools extract various sensory data and pass them to LLMs for video understanding

Key Highlights:

Integrates a suite of perception tools into the GPT4 model, including Internvideo, Whisper, Tag2text, GRiT, and T5.

Areas of Mastery:

Superior spatiotemporal reasoning abilities.
Precise event localization.
Expertise in deducing causal relationships.

Video Chat-Bot Architecture

We launched a gradio chat-bot application using multi-modal LLMs, hosted on Sage AI’s containerized Jupyter notebook server. This server also interfaces with Microsoft OpenAI’s GPT-4 endpoint through LangChain orchestration. This allows non-technical members to experiment with prompts and observe the results easily.

A simple chat interface, allowing non-tech users to upload/update a video and start talk about the video

Prompt Engineering and Evaluation

In order to refine and enhance our video understanding models, specific areas were given particular attention:

Parent-Child Interaction (PCI) Indicators: Our focus was directed towards capturing the subtle yet significant indicators of PCI. Three major parameters were set as benchmarks: Warmth, Eye Gaze, and Laughter
Joke Recognition: Given the dataset’s emphasis on parents performing jokes, it was vital for our models to accurately identify the type of joke being showcased. Recognizing the type of joke gives an added layer of context, allowing for a more comprehensive interpretation of the parent-child dynamics during that moment.
Prompt Strategies Assessment: Recognizing that the way we query or prompt our models can significantly affect the outcomes, we experimented with various prompt methodologies. By iterating and testing diverse approaches, we aimed to identify which strategies yielded the most accurate and consistent results.

By delving deep into these areas, our goal was to ensure that our models not only recognize actions but truly understand the essence and emotions behind each interaction.

Challenges, Pitfalls, and Next Steps

Our journey in refining video understanding using multi-modal LLMs was not without its challenges and learnings. A closer look at our experiences reveals the following key takeaways:

Content Authenticity Concerns: Our initial evaluations pinpointed a notable issue — the model had the capacity to generate content that wasn’t originally present in the videos. This raised questions about the model’s grounding in reality. To counteract this, one potential approach is to incorporate a Retrieval-Augmented Generation (RAG) mechanism that grounds the model’s responses using video scripts or related metadata. This ensures that the generated content remains closely aligned with the source material.
Limitations in Joke Recognition: As we delved deeper into the model’s capabilities, we realized that its proficiency in recognizing different types of jokes was somewhat limited. While it could easily identify straightforward jokes like peekaboo, its performance degraded with more complex jokes. This highlighted an area for improvement, especially given the dataset’s emphasis on parent-child joking interactions.
Tailoring the Model for PCIs: Our belief is that generic pre-trained models may not be optimally suited for specialized tasks like understanding Parent-Child Interactions (PCIs) in videos. These videos predominantly feature children’s playful babbling and laughter, elements we believe aren’t frequently encountered by many existing pre-trained multi-modal LLMs. As a result, our next steps are clear: fine-tuning these models specifically with PCI-focused data will be essential. By doing this, we hope to improve the model’s sensitivity to the unique intricacies of parent-child dynamics, ensuring more accurate and nuanced interpretations of such interactions.

With these insights in hand, our partner researchers are equipped to navigate the complexities of this project, continuously refining our approach to achieve the desired results.

In conclusion, not only did the team build a prototype of the chatbot tool that watches videos of parents and kids interacting, but also provided researchers with some guidance on further use and refinement of the approach. This tool tells us how involved children and parents are with each other. This can help experts spot if kids are growing and developing in the right way early on. And we are delighted to contribute to the pursuit of a better world! Cheers. :)

Research Partner:

Dr. Caspar Addyman: https://uk.linkedin.com/in/caspar-addyman-51643521a

Hackathon team members:

Bil Arikan: https://www.linkedin.com/in/bilarikan/

Heena Patel: https://www.linkedin.com/in/heena-patel-03976378/

Ben Cunningham: https://www.linkedin.com/in/cunninghamben/

Yu-Cheng Tsai: https://www.linkedin.com/in/yu-cheng-tsai/