VAX: Using Existing Video and Audio-based Activity Recognition
Models to Bootstrap Privacy-Sensitive Sensors

Published in

ACM UbiComp/ISWC 2023

5 min readSep 3, 2023

Co-authors/Advisors: Yuvraj Agarwal, Mayank Goel
Affiliation: School of Computer Science, Carnegie Mellon University

Hello, ISWC/Ubicomp community! 🤩

We are excited to share the story behind our new research paper VAX (Video Audio to ‘X’), which you might find interesting if you are working in the area of privacy-aware sensing technologies, human activity recognition, or, in general, excited about research in ubiquitous sensing. We want to start by presenting the motivating question behind our research.

Motivation

Human activity recognition (HAR) has a long-standing interest in the machine learning and sensing research community. Still, why do we not see a widespread consumer adaptation in smart homes for the same?

There are two major factors that inhibit this transition of HAR from research to consumer space.

A. User privacy concerns

A majority of work in HAR is focused on activity recognition using audio or video sensing, as these sensing modalities provide rich (and dense) signals for activity recognition. However, this introduces a wide variety of privacy concerns for the users. In spite of privacy assurances from huge companies that promote HAR using Video/Audio (Ring or Amazon Alexa), there have been many real-world cases that show examples of privacy leakages either by accident or malicious intent, thus exacerbating the problem further.

Fig 1: Privacy concerns and mistrust towards video surveillance and virtual audio assistants.

To fight the challenges around user privacy, research started exploring sensing modalities that have significantly fewer privacy concerns but still distinguish between different types of activities. Some of the modalities include:

Movement sensing (using Doppler radars).
Position sensing (using Lidars).
Thermal sensing (using low-res infrared cameras).
Sound localization (using Microphone arrays).

Fig 2 and Fig 3 show sensor signals across these sensing modalities for different types of in-home activities.

Fig 2: Movement and position sensing using Doppler and Lidar sensors over time for in-home activities.

Fig 3: Thermal data and Sound Localization information at a single timestamp (snapshot) for kitchen activities.

We can observe that, unlike audio and video, signals from these privacy-sensitive sensors do not reveal questionable privacy concerns for users.

B. Reliability of Activity Recognition Methods

In spite of having very low privacy concerns, a major challenge for widespread adaptation for HAR using privacy-sensitive sensors is due to the lack of generalizable models across users, given that:

Signals are highly correlated with the surrounding environment. Unlike audio sensing, signals for a given activity, most of these sensors are highly dependent on the type of environment as well as sensor placement in a given environment. Thus, building a model that works across multiple surroundings is difficult unless training data is collected on a very large scale.
Post-facto labeling is not possible to collect large-scale datasets.
Unlike audio and video, data collected using privacy-sensitive sensors cannot be labeled post-facto, which limits data collection to a small scale and imposes a significant user burden.

Our motivation behind VAX was to overcome these two challenges and take another step towards building reliable in-home activity recognition models without inducing long-term privacy concerns or significant training efforts for users.

System design for VAX

VAX is designed to leverage the strength of both audio/video sensors (generalizable models) and privacy-sensitive sensors (low privacy burden) to build per-home activity recognition models without much (or any) user burden for labeling.

The system design for VAX is based on two key insights.

Current generation pre-built A/V models are good at identifying a variety of HAR activity patterns and are usually generalizable and work across environments.
Combining multiple privacy-sensitive sensors allows us to
leverage the sensor data for each modality to build accurate per-home models for a diverse set of activities.

Fig. 4 [left] shows the hardware design for VAX, which incorporated A/V as well as privacy-sensitive sensors along with a compute and storage unit. Fig. 4 [right] shows the end-to-end architecture for VAX. In summary:

VAX proposes a novel method to bootstrap a set of off-the-shelf A/V-based ML models with some labeled data from a set of starter homes and then uses this ensemble of models to predict activities in new homes. We call this part of VAX “the A/V pipeline”.
We then propose an unsupervised learning approach that utilizes unlabelled data from ‘X’ sensors to increase the activity detection rate and reduce the impact of erroneous labels from the A/V pipeline.
Finally, we propose an approach to train activity recognition models across privacy-preserving modalities using the labels provided by the A/V pipeline.

Fig 4: [Left]-Hardware Rig for VAX, and [Right]-End-to-end architecture for VAX

Evaluation for VAX with baseline approaches

To evaluate the performance of VAX, we collected data across ten homes and 17 in-home activities across three different locations, i.e., kitchen, living room, and bathroom (more details in the paper). We compare VAX with baseline approaches in two scenarios: (a) No user input in a new home (Fig 6 [left] ), and (b) very little user input in a new home (Fig 6 [right]). we show that VAX performs significantly better in terms of accuracy as well as induced user burden for training in-home activity recognition models.

Fig 6: Comparing the performance of VAX with the baseline approach when no user input is present (left) and when some user input is present (right).

VAX can achieve (a) 74% accuracy on all activities and (b) 84% accuracy on activities detected by the A/V pipeline without any user input. When user input is present, VAX performs at par with other baseline approaches with significantly less user input. (2 labels/home vs. 17 labels/home for baseline)

For more details on implementation and evaluation, Please read our paper (soon to be published in ACM Digital Library). We would love for you to come and meet us and see our talk at Ubicomp/ISWC 2023 in Cancun.

You can find our source code at www.github.com/synergylabs/vax. We also plan to release tutorial videos and blogs to allow interested users to replicate our work and build on top of our system.

VAX: Using Existing Video and Audio-based Activity RecognitionModels to Bootstrap Privacy-Sensitive Sensors