Marrying Video and Network Traffic for Activity Recognition (and Beyond)

Published in

ACM UbiComp/ISWC 2023

6 min readJul 11, 2023

Coauthors: Shinan Liu, Tarun Mangla, Ted Shaowang, Jinjin Zhao, John Paparrizos, Sanjay Krishnan, Nick Feamster

This blog is written based on AMIR: Active Multimodal Interaction Recognition from Video and Network Traffic in Connected Environments (Accepted at Ubicomp/IMWUT’23). We open source our datasets, processed features, models, and analysis pipeline on https://amir-vidnet.github.io .

Activity recognition in our connected world has traditionally leaned heavily on video data. This technique is widely used in a variety of applications, such as elder care, home automation, and safety and security monitoring. However, despite its widespread adoption, using video data as the cornerstone for activity recognition can be problematic. Models trained on video data often lack robustness when faced with environmental changes like shifts in lighting or camera angle: if not seen, nothing can be analyzed!

In the wake of the digital era, we’ve witnessed an explosion of network-connected devices in our homes. From smart speakers to connected appliances, our interactions with these smart devices generate a significant amount of network activity. This realization gives birth to an intriguing proposition: could we use this network data as a means to compensate interaction recognition using camera?

Why a Perfect Pairing?

The answer lies within the general-purpose hardware many of us have in our homes, devices that offer a wealth of data for analysis. Video data is typically captured by in-home cameras, such as pet cams or surveillance cameras, providing a physical perspective of the environment. On the other hand, network traffic data, collected through routers, offers a cyber perspective that can give us an insight into the devices used in our homes.

Video data, though susceptible to environmental factors such as lighting and home layout, remains largely indifferent to the specific make, firmware, or type of networked devices used.

Conversely, network traffic data provides consistent readings across households, unaffected by environment but influenced by the particular devices in use.

Together, video and network traffic data provide a comprehensive view of both the physical and cyber aspects of a connected environment (smart home, building, factory, city, and etc.).

A Comprehensive Dataset for Research

For the benefit of ongoing and future research, we’ve collected and released several versions of our dataset (including processed features), each designed to cater to different research needs.

This is the first dataset of its kind to investigate paired observations on video and network traffic. We conducted this data collection in the IoT Lab, which is outfitted with 32 IoT devices selected based on their commercial shares. Each category was represented by one device.

We selected five most representative devices to gather data on both physical and proximate interactions.

Between May 2020 and March 2022, we collected 1960 demonstrations in three rounds. The data, equivalent to 30 hours of human labor, includes activities across a fridge, washer, Alexa, and Nestcam. Each demonstration averaged about 40.67 seconds of paired observations. In October 2022, we gathered an additional round of data to assess our approach’s transferability across physical environments.

Exploring a New Approach

In a recent study, we proposed the use of both video and network data for reliable interaction recognition in connected environments. We took a meta-learning-based approach to activity recognition, associating each labeled activity with both a video capture and a corresponding network traffic trace.

The implementation of this multimodal synthesis, however, isn’t trivial, particularly when it comes to real-world deployment. This challenge led to the development of AMIR (Active Multimodal Interaction Recognition), a robust and effective framework. AMIR operates by independently training models for video and network activity recognition and subsequently combining these models’ predictions using a meta-learning framework.

Say we have two pre-trained models from different modalities, one of the primary objectives of AMIR is to minimize the amount of “paired” examples necessary for the fusion of two models. We focused on deriving a weight distribution for each class to guide the active collection of paired examples.

Given trained individual models f and g, we aim to create a distribution of weights W(f, g) for each class to guide the collection of paired examples. Our approach hinges on the value of paired examples in classes where neither model f nor g is very confident, i.e., the prediction probability has the highest entropy/uncertainty:

These scores are then grouped into classes, averaged per class, and used to compute weights W(f), W(g). The final decision is based on the average of W(f), W(g). More detailed information on this process is available in our technical report in Section 3.

Promising Results and Model Availability

We’re thrilled to share that our models are now publicly available, complete with their corresponding confusion matrices.

Our unique approach, whether applied in the lab or at home, has demonstrated considerable potential. It significantly reduces the amount of “paired” demonstrations required for accurate activity recognition. Particularly, our method necessitates up to 70.83% fewer samples to attain an 85% F1 score compared to random data collection. It also improves accuracy by 17.76% when given the same number of samples.

As we continue to explore and refine this method, we are excited about its potential in improving activity recognition in our increasingly connected environments.

Behind the Scenes of AMIR

The development of AMIR wasn’t an overnight venture — rather, it represents the culmination of a two-year-long intensive effort. During this time, we devoted hundreds of hours to creating a robust data collection pipeline and gathering an extensive dataset.

But the vision of AMIR extends beyond this. We believe that the potential applications of the AMIR framework are not confined to the combination of video and network data. Its versatility and robustness make it a viable tool for synthesizing a range of other modalities as well, providing a fertile ground for future exploration and innovation in multimodal data fusion.

With a strong commitment to the ethos of open science and collaborative research, we are proud to share our work with the wider scientific community. We have open-sourced our datasets, processed features, models, and the entire analysis pipeline, which are accessible at https://amir-vidnet.github.io.

Our hope is that these resources will not only advance the study of multimodal interaction recognition but also inspire novel applications and breakthroughs in the broader field of machine learning and data analysis.

Please cite our paper if you find this blog/dataset/framework helpful!

@article{liu2023amir,
    title={AMIR: Active Multimodal Interaction Recognition from Video and Network Traffic in Connected Environments},
    author={Liu, Shinan and Mangla, Tarun and Shaowang, Ted and Zhao, Jinjin and Paparrizos, John and Krishnan, Sanjay and Feamster, Nick},
    journal={Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies},
    url = {https://doi.org/10.1145/3580818},
    year={2023},
    issue_date = {March 2023},
    address = {New York, NY, USA},
    volume = {7},
    number = {1},
}