TRI at ICCV 2021

Toyota Research Institute
Toyota Research Institute
8 min readOct 6, 2021


Toyota Research Institute is sponsoring and participating in the International Conference on Computer Vision (ICCV).

The International Conference on Computer Vision (ICCV) October 11–17 is a top international venue in computer science, with a focus on computer vision and machine learning. Toyota Research Institute (TRI) is once again a major sponsor and will be presenting six papers at the main conference. We hope you’ll join us to learn more about these scientific advancements and attend events where TRI researchers (bolded below) will be present. Be sure to check the latest presentation schedule to make sure you don’t miss us. Looking forward to seeing you at ICCV!

Is Pseudo-Lidar needed for Monocular 3D Object detection?

Dennis Park*, Rares Ambrus*, Vitor Guizilini, Jie Li, Adrien Gaidon

Recent progress in 3D object detection from single images leverages monocular depth estimation to produce 3D pointclouds, turning cameras into pseudo-lidar sensors. These two-stage detectors improve with the accuracy of the intermediate depth estimation network, which can itself be improved without manual labels via large-scale self-supervised learning. However, they tend to suffer from overfitting more than end-to-end methods, are more complex, and the gap with similar lidar-based detectors remains significant. In this work, we propose an end-to-end, single stage, monocular 3D object detector, DD3D, that can benefit from depth pre-training like pseudo-lidar methods, but without their limitations. Our architecture is designed for effective information transfer between depth estimation and 3D detection, allowing us to scale with the amount of unlabeled pre-training data. Our method achieves state-of-the-art results on two challenging benchmarks, with 16.34% and 9.28% AP for Cars and Pedestrians (respectively) on the KITTI-3D benchmark, and 41.5% mAP on NuScenes.

[paper], [code]

Geometric Unsupervised Domain Adaptation for Semantic Segmentation

Vitor Guizilini, Jie Li, Rares Ambrus, Adrien Gaidon

Simulators can efficiently generate large amounts of labeled synthetic data with perfect supervision for hard-to-label tasks like semantic segmentation. However, they introduce a domain gap that severely hurts real-world performance. We propose to use self-supervised monocular depth estimation as a proxy task to bridge this gap and improve sim-to-real unsupervised domain adaptation (UDA). Our Geometric Unsupervised Domain Adaptation method (GUDA) learns a domain-invariant representation via a multi-task objective combining synthetic semantic supervision with real-world geometric constraints on videos. GUDA establishes a new state of the art in UDA for semantic segmentation on three benchmarks, outperforming methods that use domain adversarial learning, self-training, or other self-supervised proxy tasks. Furthermore, we show that our method scales well with the quality and quantity of synthetic data while also improving depth prediction.

[paper] [code]

Learning to Track with Object Permanence

Pavel Tokmakov, Jie Li, Wolfram Burgard, Adrien Gaidon

Tracking by detection, the dominant approach for online multi-object tracking, alternates between localization and re-identification steps. As a result, it strongly depends on the quality of instantaneous observations, often failing when objects are not fully visible. In contrast, tracking in humans is underlined by the notion of object permanence: once an object is recognized, we are aware of its physical existence and can approximately localize it even under full occlusions. In this work, we introduce an end-to-end trainable approach for joint object detection and tracking that is capable of such reasoning. We build on top of the recent CenterTrack architecture, which takes pairs of frames as input, and extend it to videos of arbitrary length. To this end, we augment the model with a spatio-temporal, recurrent memory module, allowing it to reason about object locations and identities in the current frame using all the previous history. It is, however, not obvious how to train such an approach. We study this question on a new, large-scale, synthetic dataset for multi-object tracking, which provides ground truth annotations for invisible objects, and propose several approaches for supervising tracking behind occlusions. Our model, trained jointly on synthetic and real data, outperforms the state of the art on KITTI, and MOT17 datasets thanks to its robustness to occlusions.

[paper] [code]

Warp-Refine Propagation: Semi-Supervised Auto-labeling via Cycle-consistency

Aditya Ganeshan, Alexis Vallet, Yasunori Kudo, Shin-ichi Maeda, Tommi Kerola, Rares Ambrus, Dennis Park, Adrien Gaidon

Deep learning models for semantic segmentation rely on expensive, large-scale, manually annotated datasets. Labelling is a tedious process that can take hours per image. Automatically annotating video sequences by propagating sparsely labeled frames through time is a more scalable alternative. In this work, we propose a novel label propagation method, termed Warp-Refine Propagation, that combines semantic cues with geometric cues to efficiently auto-label videos. Our method learns to refine geometrically-warped labels and infuse them with learned semantic priors in a semi-supervised setting by leveraging cycle-consistency across time. We quantitatively show that our method improves label-propagation by a noteworthy margin of 13.1 mIoU on the ApolloScape dataset. Furthermore, by training with the auto-labelled frames, we achieve competitive results on three semantic-segmentation benchmarks.


LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Zhijian Liu, Simon Stent, Jie Li, John Gideon, Song Han

Computer vision tasks such as object detection and semantic/instance segmentation rely on the painstaking annotation of large training datasets. In this paper, we propose LocTex that takes advantage of the low-cost localized textual annotations (i.e., captions and synchronized mouseover gestures) to reduce the annotation effort. We introduce a contrastive pre-training framework between images and captions, and propose to supervise the cross-modal attention

map with rendered mouse traces to provide coarse localization signals. Our learned visual features capture rich semantics (from free-form captions) and accurate localization (from mouse traces), which are very effective when transferred to various downstream vision tasks. Compared with ImageNet supervised pre-training, LocTex can reduce the size of the pre-training dataset by 10× or the target dataset by 2× while achieving comparable or even improved performance on COCO instance segmentation. When provided with the same amount of annotations, LocTex achieves around 4% higher accuracy than the previous state-of-the art “vision+language” pre-training approach on the task of PASCAL VOC image classification.

[Project page][Paper]

The Way To My Heart Is Through Contrastive Learning: Remote Photoplethysmography From Unlabeled Video

John Gideon*, Simon Stent*

The ability to reliably estimate physiological signals from video is a powerful tool in low-cost, pre-clinical health monitoring. In this work we propose a new approach to remote photoplethysmography (rPPG) — the measurement of blood volume changes from observations of a person’s face or skin. Similar to current state-of-the-art methods for rPPG, we apply neural networks to learn deep representations with invariance to nuisance image variation. In contrast to such methods, we employ a fully self-supervised training approach, which has no reliance on expensive ground truth physiological training data. Our proposed method uses contrastive learning with a weak prior over the frequency and temporal smoothness of the target signal of interest. We evaluate our approach on four rPPG datasets, showing that comparable or better results can be achieved compared to recent supervised deep learning methods but without using any annotation. In addition, we incorporate a learned saliency resampling module into both our unsupervised approach and supervised baseline. We show that by allowing the model to learn where to sample the input image, we can reduce the need for hand-engineered features while providing some interpretability into the model’s behavior and possible failure modes. We release code for our complete training and evaluation pipeline to encourage reproducible progress in this exciting new direction. In addition, we used our proposed approach as the basis of our winning entry to the ICCV 2021 Vision 4 Vitals Workshop Challenge.

[project page with code/paper links]

3D Object Detection from Images (3DODI) — Oct. 11


The first Workshop on 3D Object Detection from Images (3DODI) aims to gather researchers and engineers from academia and industry to discuss the latest advances in Image-based 3D object detection.

Dr. Dennis Park & Dr. Adrien Gaidon will both present TRI’s research on monocular 3D object detection.

2nd Workshop on Benchmarking Trajectory Forecasting Models (BTFM) — Oct. 16


This workshop will discuss recent advancements in human motion prediction with researchers in computer vision, robotics and cognitive neuroscience areas that work on safely enabling autonomous systems to proactively act in complex contexts involving humans and moving objects.

Dr. Adrien Gaidon will discuss TRI’s recent advances in multi-agent trajectory forecasting and uncertainty modeling.

The ROAD Challenge: Event Detection for Situation Awareness in Autonomous Driving — Oct. 16


The goal of this workshop is to put to the forefront of the research in autonomous driving the topic of situation awareness, intended as the ability to create semantically useful representations of dynamic road scenes in terms of the notion of ‘road event’.

Dr. Adrien Gaidon will present TRI’s recent works on dynamic scene understanding.

Multi-Agent Interaction and Relational Reasoning (MAIR2) — Oct. 17


This workshop promotes research in modelling relations and interactions between agents (e.g., objects, robots, and humans) in research areas such as autonomous driving, scene understanding, human-robot interaction, intuitive physics, and dynamics modeling.

Dr. Rowan McAllister is co-organizing the first workshop on Multi-Agent Interaction and Relational Reasoning.

Autonomous Vehicle Vision (AVVision) — Oct. 17


This workshop aims to bring together industry professionals and academics to exchange ideas on the advancement of computer vision techniques for autonomous driving, discussing the state of the art as well as existing challenges in autonomous driving.

Dr. Rowan McAllister is co-organizing the second workshop on Autonomous Vehicle Vision.



Toyota Research Institute
Toyota Research Institute

Applied and forward-looking research to create a new world of mobility that's safe, reliable, accessible and pervasive.