Detecting Unconstrained Inter-Person
Conversations With A Smartwatch

Dawei Liang
ACM UbiComp/ISWC 2023
7 min readOct 9, 2023

Associated IMWUT publication: Automated Face-To-Face Conversation Detection on a Commodity Smartwatch with Acoustic Sensing

Paper co-authors : Alice Zhang, Edison Thomaz

Social interactions are closely related to our daily life. Through social interactions, we exchange information, thoughts, and feelings with other individuals. Hence, tracking and quantifying everyday social interactions is insightful for several domains, including computational behavior analysis, mental health monitoring, and social network studies.

While social interactions can be in several ways, including verbal and non-verbal, one of the most common components of social interactions in our daily life is inter-personal spoken communications, i.e., conversations. Hence, as a first step to passively and objectively capture one’s moments of social interactions in unconstrained daily living, in this research, we investigate the automated inference of face-to-face conversations. We specifically explore the usage of a single commodity device that many people wear every day, i.e, the smartwatch.

Unobtrusive Sensing of Conversations

In the past, documenting one’s conversational events in daily life could be realized by the user’s self-report. However, studies have shown that data collected by self-reporting can be biased. Furthermore, self-reporting is generally not scalable and hard to deploy in longitudinaland unconstrained daily living scenarios. Beyond self-reports, researchers have also investigated methods to infer face-to-face conversations by leveraging physiological signals such as respirational signals from the chest.

Among the various input modalities that can be captured by a commodity sensing device, human voice activities are directly related to the conversational process. Hence, detecting the turn-takings in the human voice can be a straightforward way to discover human conversational events. Another advantage of detecting conversations through voice is that sound is transmitted through air, so the sensing process can be implemented in a flexible way with less obtrusiveness. We therefore unleash the audio listening capabilities of commodity smartwatches for the purpose.

Modeling Conversations

The detection of conversations is considered as a classification problem in our work. Specifically, we categorized human conversations based on turn-takings. A segment of speech is considered as conversation only when valid turn-takings are contained in the segment and shared between the user and at least one other social participant. We also formulated other types of human speech as other speech, such as monologues, TV sound, or background human voice in public spaces. Non-vocal noise and silence are formulated as ambient sounds. This way, we are able to build and deploy classification models for the problem.

Deep learning has been known to be powerful for speech-relevant tasks. So we leveraged deep learning models in our research as well. Below is a customized neural network architecture that we developed to capture conversations based on audio on a watch.

The design principle of the model is based on feature fusion. It consists of two separate input branches, where the first branch is used to generate foreground speech representations from a pre-trained feature extractor, and the second branch is used to transform general acoustic spectrogram features. Foreground speech refers to voices captured in close proximity to the watch’s microphone, as we assume that such voices could be mostly generated by the smartwatch user. The outputs of the branches are then concatenated so that knowledge of the two types of input representations can be jointly learnt by the model. The final output of the model is the probability of each target sound class based on the input audio.

Study Data Collection

To build and evaluate our conversation models, we conducted two sets of experiments. One was based on a semi-naturalistic scripted study with 39 participants, and the other was based on an unconstrained free-living study with four participants. Specifically, the audio of social interactions was recorded using a Google Fossil smartwatch worn by the participants.

In the semi-naturalistic study, the 39 participants were recruited from 18 different homes in local Austin, TX, formulating 18 distinct social interaction groups on a family basis. Ahead of the study, each participant group was provided with a study script, documenting the types of social events that they should perform. The list of social events is as below:

To ensure that the study was naturalistic, the entire study was run remotely without the researchers being present. During the data collection, the acoustic environment of the homes was left as usual.

In the unconstrained study, one participant wore a watch for a whole week for data collection. The other three participants recorded for around three hours each. This study was unconstrained because it was unscripted, and a variety of daily activities and contexts were included during the data collection process:

All the audio data was annotated by human researchers after the studies were completed. For more details of the annotation process, quality measures, and sample data, please refer to our full paper.

Detection Performance

In the semi-naturalistic study, we followed a leave-one-group-out (LOGO) evaluation scheme. In the unconstrained study, we selected the best checkpoint model from the semi-naturalistic study and performed only inference.

In general, our proposed model was able to obtain compelling inference performance, especially when compared to other tested models, such as pre-trained CNN14, YAMNet, and customized MobileNetV1. The proposed feature fusion method also largely reduces the model size while keeping the model’s performance, and this is helpful for real-time deployment on a commercial smartwatch.

Below is the model’s inference performance (detection of conversations vs. non-conversations) for the unconstrained study, and we can see that the model’s performance is promising for most encountered scenarios. The model’s performance was the worst in a bar, which is expected because of the mixture of user foreground speech and background human voices.

System Prototyping

Our system was prototyped and tested on a Google Fossil Gen 5 smartwatch. In this version of the prototype, our application executes individual inference cycles independently when requested via the user interface. Once running, the application triggers the device’s microphone for continuous audio recording at 16 kHz, calculates the FFT features, and calls the installed neural network model to generate inference.

In this version, we further optimized our fusion model by pruning the model weights and adding post-training optimization. The table below shows the comparison of the deployed checkpoint sizes in TFLite format and the inference latency on the Fossil watch, with and without optimization.

Since energy consumption is another constraining factor for applications run on a watch, we also profiled the battery usage of the application. Specifically, the battery of a Fossil watch has a capacity of 310 mAh. In our experiment, one user wore a smartwatch with the application running continuously until the battery died.

As the baseline, the application performed continuous loops of inference cycles without any system- or power- level optimizations. We then implemented an energy-based voice activity detection (VAD) method to gate feature extraction and model inference. This was used to optimize the power consumption of the application by identifying periods of silence in which the audio energy falls below a pre-defined threshold and the inference processes do not need to be executed. The figure below shows that gating the application based on thresholds of the 10th and 30th percentile audio energy levels helps to improve the battery life of running the application.

In our paper, we discussed a few other possibilities to further optimize the battery consumption of the application, including using the location context and the user’s activities.

Conclusion

Automated detection of physical social interactions in daily living is helpful for quantifying user behaviors regarding the interaction process and potentially revealing the user’s mental health states. As a first step in social interaction detection, we present a practical system for user conversation detection with a single smartwatch by leveraging its acoustic sensing capability. We designed a customized neural network for feature fusion, conducted real-world user studies for model development and evaluation, and prototyped the system on a commercial smartwatch. In the future, we hope to further improve the real-time performance and power consumption of the system to enable longitudinal daily usage for real-world users.

--

--

Dawei Liang
ACM UbiComp/ISWC 2023
0 Followers

PhD student @The University of Texas at Austin