3D Scene Understanding: Open3DSG’s Open-Vocabulary Approach to Point Clouds

A CVPR Paper Review and Cliff’s Notes

Published in

Voxel51

4 min readJun 13, 2024

Understanding 3D environments is a critical challenge in computer vision, particularly for robotics and indoor applications.

The paper, Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships, introduces a novel approach for predicting 3D scene graphs from point clouds in an open-world setting. The paper’s main contribution is a method that leverages features from powerful 2D vision language models (VLMs) and large language models (LLMs) to predict 3D scene graphs in a zero-shot manner. This allows for querying object classes from an open vocabulary and predicting inter-object relationships beyond a predefined label set.

This research moves beyond traditional, predefined class limitations by leveraging vision-language models to identify and describe arbitrary objects and their relationships, setting a new standard for machine perception and interaction in complex environments.

The Problem

Current 3D scene graph prediction methods depend heavily on labeled datasets, restricting them to a fixed set of object classes and relationship categories. This limitation reduces their effectiveness in real-world applications where a broader and more flexible vocabulary is necessary.

Insufficiencies of Current Methods

Fixed Label Set: Traditional methods are confined to a narrow scope of training data, hindering their ability to generalize to unseen object classes and relationships.
Lack of Compositional Understanding: Existing 2D VLMs struggle with modeling complex relationships between objects, which is crucial for accurate 3D scene graph predictions.
Inflexibility: Supervised training with fixed labels cannot adapt to new or rare object classes and relationships, limiting the practical utility of the models.

The Solution

The paper proposes Open3DSG, an approach to learning 3D scene graph prediction without relying on labelled scene graph data. The method co-embeds the features from a 3D scene graph prediction backbone with the feature space of open-world 2D VLMs.

How the Solution Works

Integrate knowledge from vision-language models into a Graph Neural Network (GNN) using point clouds, RGB-D images, and poses. Infer object classes by computing the cosine similarity between object queries and our distilled 3D node features. Then, use the inferred object classes and edge embeddings to predict relationships between pairs of objects using Qformer & LLM from InstructBLIP.

Initial Graph Construction: The method begins by constructing an initial graph representation from a 3D point cloud using class-agnostic instance segmentation.
Feature Extraction and Alignment: Features are extracted from the 3D scene using a Graph Neural Network (GNN) and aligned with 2D vision-language features.
Object Class Prediction: At inference time, object classes are predicted by computing the cosine similarity between the distilled 3D features and open-vocabulary queries encoded by CLIP.
Relationship Prediction: Inter-object relationships are predicted using a feature vector and the inferred object classes, providing context to a large language model.

Improvements Introduced

In feature extraction, select top k frames for object and predicate supervision from the 3D point cloud. Encode object frames using OpenSeg and aggregate features over projected points. For predicates, we identify object pairs, crop images at multiple scales, and compute features using the BLIP image encoder, then aggregate them. Finally, fuse object and predicate features across multiple views.

Open-Vocabulary Predictions: The method can predict arbitrary object classes and relationships, not limited to a predefined set.
Zero-Shot Learning: This approach allows for zero-shot predictions. It can generalize to new objects and relationships without additional training data.
Compositional Understanding: The method enhances the ability to model complex relationships between objects by combining VLMs with LLMs.

Why It’s Better

Detail and Realism: The method provides fine-grained semantic descriptions of objects and relationships, capturing the complexity of real-world scenes.
Efficiency: By aligning 3D features with 2D VLMs, the method achieves effective scene graph predictions without requiring extensive labeled datasets.
Computational Power: The approach leverages powerful existing models (like CLIP and large language models), enhancing its ability to generalize and perform complex reasoning tasks.

Key Contributions

First Open-Vocabulary 3D Scene Graph Prediction: This paper presents the first method for predicting 3D scene graphs with an open vocabulary for objects and relationships.
Integration of VLMs and LLMs: This approach combines the strengths of vision-language models and large language models to improve compositional understanding.
Interactive Graph Representation: The method allows for querying objects and relationships in a scene during inference time.

Results

Experimental Validation: The method was tested on the closed-set benchmark 3DSSG, showing promising results in modelling compositional concepts.
Comparison with State-of-the-Art Methods: Open3DSG demonstrated the ability to handle arbitrary object classes and complex inter-object relationships more effectively than existing methods.

Final Thoughts

As a forward-thinking system, Open3DSG’s benefits are twofold:

Enhances the expressiveness and adaptability of 3D scene graphs
Paves the way for a more intuitive machine understanding of complex environments.

With applications ranging from robotics to indoor scene analyses, the potential is vast. The improvements introduced by Open3DSG are significant as they enable a more flexible and detailed understanding of 3D scenes.

This can be particularly important for computer vision and robotics applications, where understanding complex scenes is crucial.

Will you be at CVPR 2024? Come by the Voxel51 booth and say “Hi!”!