Unravelling the mystery of parent-child interaction with Artificial Intelligence

Ariane Goudie
Sage Ai
Published in
7 min readJan 31, 2024

By Ariane Goudie, Bona Chow, Ernest Chow and Saed Hussain

Am I a good parent?

Now that I have your attention, a quick disclaimer: we won’t be answering that question. However, Sage does give us the flexibility to commit 5 days each year to a charitable cause, and so my team and I chose to spend ours helping researchers address the following:

  • What does good parenting look like?
  • How does society measure it?

Such questions have plagued parents across time and cultures, and in corporate speak, we needed a solution yesterday. So you’ll be pleased to know that we had parenting completely figured out by the end of a three-day AI hackathon…

Just kidding. What could we possibly do in an AI hackathon that would answer these questions?

There to guide us was Dr Caspar Addyman, a developmental psychologist specialising in learning in infancy. Having long studied the interactions between parent and child, his research goal is to develop explicit measurements of parent-infant rapport, and ultimately to have these measurements automated using AI.

This is part of a broader Global Parenting Initiative (GPI), the goal of which is to make evidence-based parenting guidance accessible to millions of parents in the under-resourced countries of the global south.

As nebulous as the concept of ‘good parenting’ is, can the qualities of a healthy rapport in an infant’s early years, so vital for their long-term development, be identified?

Quantifying the unquantifiable

To answer this question, Dr Addyman has turned to parent-child interaction (PCI) videos:

PCI videos are the gold standard measurement tool in research on parenting with children under 3 years old. The videos are scored for markers such as the child’s emotional state, how responsive the parent is to behaviours of the child, and the synchrony between the two .’

Typically these videos involve parents interacting with their child in some everyday activity (e.g. free play, feeding, reading a book) for about 5 minutes.

The assessment of a given parent-child interaction video is highly consistent across experts, making this a reliable and accurate method of assessing rapport. This process is deeply intuitive and qualitative however, making explicit indicators of rapport difficult to pin down.

Infant laughter datasets: what’s in a laugh?

One thing is clear: for an evaluation of parent-infant rapport to be automated using AI, clear metrics are essential for transparency. Why was a given interaction classified as high or low responsiveness?

When analysing parent-child interactions, experts generally look for responses to certain stimuli. Examples of such responses include eye-gaze, head-direction, facial expression, and infant laughter.

Curious as to whether these responses could be automatically detected, Dr Addyman conducted a large-scale online experiment, in which parents were tasked with performing a series of jokes for their children in their homes (e.g. peekaboo, tearing a piece of paper, putting a cup on their head) and recording whether their infant laughed or not.

The participants in this alternative study gave permission for their video recordings to be used in secondary analyses, thus we were provided with approximately 1500 short videos (10–20 seconds long), each one containing a demonstration of a joke and the infant’s reaction.

Transparency in Simplicity

Challenged to automate the detection of parent-child synchrony in these videos (in a way that could rival manual expert assessment), we set ourselves the goal of:

  • Detecting instances of parental speech or infant laughter
  • Determining head direction, facial expression and eye-gaze from movement data
  • Prototyping an interface for non-technical users

All whilst maintaining a transparent methodology and preserving anonymity.

Ease-of-implementation and explainability were our top priorities for a hackathon solution, wanting to avoid the blackbox scenario that can emerge from overly complex AI models.

Introducing: the Laughter-Segmenter

During this exploratory phase, we encountered recent research that presented a deep-learning-based model for laughter detection.

Given the time constraints of the hackathon, we opted not to retrain the AI model with our own dataset. Instead, we applied the existing model directly to build our Proof of Concept (PoC) solution.

Once we had inputted the infant laughter audio files, the model proved highly effective in segmenting and providing the timestamps for detected laughter, achieving a 70% accuracy rate. A limitation of this deep-learning model, however, is the operational requirement for a graphics processing unit (GPU). In a CPU-only inference mode, the model takes about 1.5 minutes on average to evaluate a four-minute audio file. The model also generated a significant number of false positives, though this can be mitigated by fine-tuning the model with the infant laughter dataset.

A prototype of our user-friendly interface. Researchers simply need to drag-and-drop audio files to receive a time-stamped break-down of infant laughter.

What’s so funny?

While the identification of laughter is a step in the right direction, it is of limited use without context. What if we could identify the words spoken by a parent to identify the type of joke? This could help establish causality between the joke and the laughter, getting us closer to a measurement of parent-child synchrony.

Using Google Cloud Speech, an open-source speech-to-text transcription tool, we extracted sounds and words mentioned by parents. This was done with varying degrees of success; ‘peekaboo’ jokes presented the model with difficulty, whereas ‘do you like my hat’ was detected with a high degree of accuracy.

Pose Estimation

Having succeeded in our detection of laughter from the audio files, we attempted to detect head direction, facial expression and eye-gaze from video data. Using a pre-trained pose-estimation model, YOLOv8, we detected facial points (nose/ear/eyes) on the infant and parent across video frames.

Due to time constraints, the majority of our efforts were focused on the detection of audio laughter due its easily interpretable and anonymous nature. Given more time, we would use the facial points detected by YOLOv8 to determine gaze and head direction. Analysis of the motion of both subjects could indicate whether they are moving towards or away from each other, further substantiating a measurement of parent-child synchrony.

Examples of the pose-estimation model identifying objects and capturing movement from video data have been omitted to protect the privacy of the participants. However, one of our team members kindly volunteered a picture of himself, so for demonstrative purposes, here’s the result of the pose-estimation model applied to a picture of one of our machine learning engineers back in the day.

An example of YOLOv8 pose-estimation.

Results

So, where did three days of hard work get us?

Using the results of our Laughter-Segmenter, we were able to automate the detection of infant laughter in a way that is cheap, user-friendly and scalable.

Although our model doesn’t output a score for a parent-child interaction, identification of the infant laughter response to the joke stimulus is an important step towards an automated measurement of rapport.

Examples of laughter and speech detection for two separate videos capturing a joke and the infant’s reaction.

The advantage of this simple approach over more sophisticated AI models is transparency. A neural network-based model will exploit many more signals from the data, but will have a decision process leading to the final output (the assessment of rapport) that is non-linear and obscured by layers of complexity.

Ethics and Next Steps

If we are to correctly infer rapport from various signals in the video data, there are several things we must acknowledge.

Firstly, the effect of cultural differences, for example in the degree to which expressiveness in children is encouraged, is an important consideration in any assessment. So is the variation in cognitive, verbal and social abilities. For this reason, the inference of parent-child rapport from this data cannot hinge solely on the detection of laughter at the appropriate points. To avoid sacrificing accuracy for transparency, the detection of other signals such as head direction, facial expression and eye gaze is critical.

In rendering explicit the extremely intuitive process undertaken by psychologists when assessing human interactions, there arises another salient question: of how much importance is the human observer? Is there information in the interaction that cannot be captured by video or audio, necessitating the presence of a human in the room to be detected?

As with all good research, we began with questions and ended up asking more. By couching further enhancements of this model with a continuous assessment of the risks and advantages, we may well get some answers.

--

--