TRI At CVPR 2023

Toyota Research Institute
Toyota Research Institute
9 min readJun 14, 2023


The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is the premier annual computer vision event comprising the main conference and several co-located workshops and short courses. This year, CVPR will be a single track such that everyone (with full passport registration) can attend everything. The focus will be on a few selected plenary talks, scientific discussions at poster sessions, and plenty of time for networking and socializing.

This year, Toyota Research Institute (TRI) is once again a Platinum sponsor and will be presenting new research findings and participating in a number of workshops. Check out the main conference and workshops below to learn where TRI researchers will be present. We look forward to talking to you at this year’s CVPR — you can find us at Booth 1130!

Note: Abstracts are pulled from papers and not all authors are TRI employees.


Synthetic Data for Autonomous Systems (SDAS)

Date: Sunday, June 18th, 2023

Location: West 302–305


Organizers: Omar Maher (Parallel Domain), Alex Zook (NVIDIA), Rares Ambrus (TRI), Dengxin Dai (MPI for Informatics)

Adrien Gaidon, Director of the Machine Learning Division at TRI will give a talk about “Synthetic data for embodied foundations.”

Vision-Centric Autonomous Driving

Date: Monday, June 19th, 2023

Location: West 302–305


Organizers: Yue Wang (NVIDIA), Hang Zhao (Tsinghua University), Vitor Guizilini (TRI), Katie Driggs-Campbell (University of Illinois), Xin Wang (Microsoft Research)

Workshop Papers

Workshop: The Second Workshop on Structural and Compositional Learning on 3D Data

Workshop Paper: ROAD: Learning an Implicit Recursive Octree Auto-Decoder to Efficiently Encode 3D Shapes

Authors: Sergey Zakharov, Rares Ambrus, Katherine Liu, Adrien Gaidon

Details: Sunday, June 18th, 2023, 11:40 am — 12:30 pm PDT

Abstract: Compact and accurate representations of 3D shapes are central to many perception and robotics tasks. State-of-the-art learning-based methods can reconstruct single objects but scale poorly to large datasets. We present a novel recursive implicit representation to efficiently and accurately encode large datasets of complex 3D shapes by recursively traversing an implicit octree in latent space. Our implicit Recursive Octree Auto-Decoder (ROAD) learns a hierarchically structured latent space enabling state-of-the-art reconstruction results at a compression ratio above 99%. We also propose an efficient curriculum learning scheme that naturally exploits the coarse-to-fine properties of the underlying octree spatial representation. We explore the scaling law relating latent space dimension, dataset size, and reconstruction accuracy, showing that increasing the latent space dimension is enough to scale to large shape datasets. Finally, we show that our learned latent space encodes a coarse-to-fine hierarchical structure yielding reusable latents across different levels of details, and we provide qualitative evidence of generalization to novel shapes outside the training set.

Workshop: 3DMV: Learning 3D with Multi-View Supervision

Workshop Paper: Depth Field Networks for Generalizable Multi-view Scene Representation

Authors: Vitor Guizilini*, Igor Vasiljevic*, Jiading Fang*, Rares Ambrus, Greg Shakhnarovich, Matthew Walter, Adrien Gaidon

Details: Monday, June 19th, 2023, 9:35 am — 10:30 am PDT

Abstract: Modern 3D computer vision leverages learning to boost geometric reasoning, mapping image data to classical structures such as cost volumes or epipolar constraints to improve matching. These architectures are specialized according to the particular problem, and thus require significant task-specific tuning, often leading to poor domain generalization performance. Recently, generalist Transformer architectures have achieved impressive results in tasks such as optical flow and depth estimation by encoding geometric priors as inputs rather than as enforced constraints. In this paper, we extend this idea and propose to learn an implicit, multi-view consistent scene representation, introducing a series of 3D data augmentation techniques as a geometric inductive prior to increase view diversity. We also show that introducing view synthesis as an auxiliary task further improves depth estimation. Our Depth Field Networks (DeFiNe) achieve state-of-the-art results in stereo and video depth estimation without explicit geometric constraints, and improve on zero-shot domain generalization by a wide margin.

Main Conference

Paper: Viewpoint Equivariance for Multi-View 3D Object Detection

Authors: Dian Chen, Jie Li, Vitor Guizilini, Rares Ambrus, Adrien Gaidon

Details: Wednesday, June 21st, 2023, 10:30 am PDT

Abstract: 3D object detection from visual sensors is a cornerstone capability of robotic systems. State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input. In this work, we gain intuition from the integral role of multi-view consistency in 3D scene understanding and geometric learning. To this end, we introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry to improve localization through viewpoint awareness and equivariance. VEDet leverages a query-based transformer architecture and encodes the 3D scene by augmenting image features with positional encodings from their 3D perspective geometry. We design view-conditioned queries at the output level, which enables the generation of multiple virtual frames during training to learn viewpoint equivariance by enforcing multi-view consistency. The multi-view geometry injected at the input level as positional encodings and regularized at the loss level provides rich geometric cues for 3D object detection, leading to state-of-the-art performance on the nuScenes benchmark. The code and model are made available at

Paper: Multi-Object Manipulation via Object-Centric Neural Scattering Functions

Authors: Stephen Tian, Yancheng Cai, Hong-Xing Yu, Sergey Zakharov, Katherine Liu, Adrien Gaidon, Yunzhu Li, Jiajun Wu

Details: Wednesday, June 21st, 2023, 10:30 am PDT

Abstract: Learned visual dynamics models have proven effective for robotic manipulation tasks. Yet, it remains unclear how best to represent scenes involving multi-object interactions. Current methods decompose a scene into discrete objects, but they struggle with precise modeling and manipulation amid challenging lighting conditions as they only encode appearance tied with specific illuminations. In this work, we propose using object-centric neural scattering functions (OSFs) as object representations in a model-predictive control framework. OSFs model per-object light transport, enabling compositional scene re-rendering under object rearrangement and varying lighting conditions. By combining this approach with inverse parameter estimation and graph-based neural dynamics models, we demonstrate improved model-predictive control performance and generalization in compositional multi-object environments, even in previously unseen scenarios and harsh lighting conditions.

Paper: Tracking Through Containers and Occluders in the Wild

Authors: Basile Van Hoorick, Pavel Tokmakov, Simon Stent, Jie Li, Carl Vondrick

Details: Wednesday, June 21st, 2023, 4:30 pm PDT

Abstract: Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems. In this paper, we introduce TCOW, a new benchmark and model for visual tracking through heavy occlusion and containment. We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists. To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment. We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence.

Paper: Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking

Authors: Ziqi Pang, Jie Li, Pavel Tokmakov, Dian Chen, Sergey Zagoruyko, Yu-Xiong Wang

Details: Thursday, June 22nd, 2023, 10:30 am PDT

Abstract: This work proposes an end-to-end multi-camera 3D multi-object tracking (MOT) framework. It emphasizes spatio-temporal continuity and integrates both past and future reasoning for tracked objects. Thus, we name it “Past and-Future reasoning for Tracking” (PF-Track). Specifically, our method adopts the “tracking by attention” framework and represents tracked instances coherently over time with object queries. To explicitly use historical cues, our “Past Reasoning” module learns to refine the tracks and enhance the object features by cross-attending to queries from previous frames and other objects. The “Future Reasoning” module digests historical information and predicts robust future trajectories. In the case of long-term occlusions, our method maintains the object positions and enables re-association by integrating motion predictions. On the nuScenes dataset, our method improves AMOTA by a large margin and remarkably reduces ID-Switches by 90% compared to prior approaches, which is an order of magnitude less. The code and models are made available at

Paper: Breaking the “Object” in Video Object Segmentation

Authors: Pavel Tokmakov, Jie Li, Adrien Gaidon

Details: Thursday, June 22nd, 2023, 4:30 pm PDT

Abstract: The appearance of an object can be fleeting when it transforms. As eggs are broken or paper is torn, their color, shape and texture can change dramatically, preserving virtually nothing of the original except for the identity itself. Yet, this important phenomenon is largely absent from existing video object segmentation (VOS) benchmarks. In this work, we close the gap by collecting a new dataset for Video Object Segmentation under Transformations (VOST). It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent. We then extensively evaluate state-of-the-art VOS methods and make a number of important discoveries. In particular, we show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues. This motivates us to propose a few modifications for the top-performing baseline that improve its capabilities by better modeling spatio-temporal information. But more broadly, the hope is to stimulate discussion on learning more robust video object representations.

Paper: Object Discovery From Motion-Guided Tokens

Authors: Zhipeng Bao, Pavel Tokmakov, Yu-Xiong Wang, Adrien Gaidon, Martial Hebert

Details: Thursday, June 22nd, 2023, 4:30 pm PDT

Abstract: Object discovery — separating objects from the background without manual labels — is a fundamental open challenge in computer vision. Previous methods struggle to go beyond clustering of low-level cues, whether handcrafted (e.g., color, texture) or learned (e.g., from auto-encoders). In this work, we augment the auto-encoder representation learning framework with two key components: motion-guidance and mid-level feature tokenization. Although both have been separately investigated, we introduce a new transformer decoder showing that their benefits can compound thanks to motion-guided vector quantization. We show that our architecture effectively leverages the synergy between motion and tokenization, improving upon the state of the art on both synthetic and real datasets. Our approach enables the emergence of interpretable object-specific mid-level features, demonstrating the benefits of motion-guidance (no labeling) and quantization (interpretability, memory efficiency).

Paper: CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects

Authors: Nick Heppert, Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Rares Andrei Ambrus, Jeannette Bohg, Abhinav Valada, Thomas Kollar

Details: Thursday, June 22nd, 2023, 4:30 pm PDT

Abstract: We present CARTO, a novel approach for reconstructing multiple articulated objects from a single stereo RGB observation. We use implicit object-centric representations and learn a single geometry and articulation decoder for multiple object categories. Despite training on multiple categories, our decoder achieves a comparable reconstruction accuracy to methods that train bespoke decoders separately for each category. Combined with our stereo image encoder, we infer the 3D shape, 6D pose, size, joint type, and the joint state of multiple unknown objects in a single forward pass. Our method achieves a 20.4% absolute improvement in mAP 3D IOU50 for novel instances when compared to a two-stage pipeline. Inference time is fast and can run on an NVIDIA TITAN XP GPU at 1 HZ for eight or less objects present. While only trained on simulated data, CARTO transfers to real-world object instances. Code and evaluation data is available at

TRI-Sponsored Papers

The following TRI-sponsored papers will also be presented during the conference:

  • HexPlane: A Fast Representation for Dynamic Scenes. Ang Cao, Justin Johnson. Tuesday, June 20th, 2023, 10:30 am PDT.
  • Humans as Light Bulbs: 3D Human Reconstruction from Thermal Reflection. Ruoshi Liu and Carl Vondrick. Wednesday, June 21st, 2023, 4:30 pm PDT.
  • Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation. Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, Greg Shakhnarovich. Wednesday, June 21st, 2023, 4:30 pm PDT.
  • Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data. Nilesh Kulkarni, Linyi Jin, Justin Johnson, David F. Fouhey. Thursday, June 22nd, 2023, 10:30 am PDT.
  • The ObjectFolder Benchmark: Multisensory Object-Centric Learning with Neural and Real Objects. Ruohan Gao, Yiming Dou, Hao Li, Tanmay Agarwal, Jeannette Bohg, Yunzhu Li, Li Fei-Fei, Jiajun Wu. Thursday, June 22nd, 2023, 10:30 am PDT.



Toyota Research Institute
Toyota Research Institute

Applied and forward-looking research to create a new world of mobility that's safe, reliable, accessible and pervasive.