Describing 3D objects using Natural Language

Loci
8 min readSep 12, 2024

--

Creating machines that understand the physical world requires teaching them to make sense of many types of sensory inputs. This means being able to understand not just text and images, but also 3D objects. In part one of this blog post series, we argued why developing artificial intelligence (AI) systems that understand 3D information is important as well as some of the challenges that come with trying to develop such systems.

One way we can reason about 3D objects is by automatically describing them using natural language. While this may sound like a simple task, it is surprisingly intricate and comes with a few conceptual and practical challenges. Tagging and captioning are two important ways in which we can describe 3D assets using natural language, and in this blog post we will describe how we at Loci are training AI systems that can accomplish these tasks.

Tagging and Captioning

Captioning means automatically creating a short description of an asset. Tagging means creating a list of keywords or keyphrases that summarise the salient features of an asset.

The ability to automatically tag and caption a 3D asset is important for enabling text-based search at scale. Major companies across various sectors that work with 3D assets, such as game development, animation and architecture, have a critical problem that they are trying to solve: making their asset libraries easily searchable in order to improve discoverability and enable asset reuse. Augmenting 3D assets with text metadata like captions and tags is one way in which we can supercharge search engines for 3D asset libraries.

Automatically creating this kind of text metadata for 3D assets requires training an AI system that can understand both 3D information as well as natural language by representing them together in the same domain. Training such a system comes with a few practical problems we need to consider:

  • Evaluation: The quality of tags and captions are highly subjective and might be dependent on the specific use case in consideration. For instance, we may require tags that are very granular and describe all the different parts of the 3D asset, or we may require them to describe the asset more holistically. We may or may not want tags that describe the style, texture or materials present in the asset. We may require long, detailed captions or short, succinct captions. Having potentially varying requirements like this makes both qualitative and quantitative evaluation of these systems very challenging.
  • Runtime and cost: Since we are interested in labelling a large number of assets at scale, these systems need to run in a reasonable amount of time on hardware that is not too costly in order to be practically useful. While scaling up the size of an AI model can improve its qualitative performance, it may become impossible to run it in a cost efficient way. Thus there is typically a tradeoff between model performance and efficiency, and optimising models while preserving their performance as much as possible becomes an important problem to solve.

Language Models for 3D data

It is quite natural to view tagging and captioning as generative language modelling tasks, since we need to produce a natural language output given a 3D input. This means that we can leverage recent advances in Large Language Models (LLMs) to solve these tasks. However there is an important caveat: until recently most LLMs like GPT-4 [1] or Llama [2] could only understand and generate text. Over the past year significant advances have been made in language models that can also understand images, and generate text based on them. These types of models are typically known as Vision Language Models (VLMs) [3]. Examples of VLMs include proprietary models such as GPT-4o from OpenAI and Claude from Anthropic, and open-source models such as LLaVA [4] or PaliGemma [5].

These models are trained by taking an image encoder model such as CLIP [6] that has been pre-trained on images, along with an LLM such as Llama-3 [7] that has been pre-trained on text, and combining them. Typically this is done by treating the output of the image encoder as ‘image tokens’ and concatenating them with text tokens in order to form the input to the LLM. Before concatenation, the image tokens are cast to the same representation space as the text tokens using a small trainable projection layer. The resulting model is trained using paired image-text data. This training is usually performed in multiple stages, where in each successive stage we gradually unfreeze different parts of the overall model while also increasing the quality of the training data [8]. This multi-stage training is primarily needed because the availability of high-quality image-text data is quite limited.

We can use a similar approach to train a VLM that understands 3D assets. Assuming we have a pre-trained encoder for 3D assets and a pre-trained LLM, we can combine and fine-tune them in a style analogous to image-text VLMs in order to perform tasks like captioning and tagging. Open-source 3D VLMs have been trained that can perform diverse 3D tasks such as object captioning (Point-LLM [9]), scene captioning (Scene-LLM [10]), bounding box generation (LiDAR-LLM [11]), embodied interaction (ShapeLLM [12]) and navigation (Agent3D-Zero [13]).

Challenges in creating a 3D VLM

While there exist open source 3D VLMs that can generate text from 3D models, the performance and practical usefulness of such VLMs is limited by a few crucial factors. Addressing these factors is key to creating a state-of-the-art, production ready 3D VLM that generates accurate tags and captions for 3D assets.

The quality of the 3D encoder, as well as the choice of 3D representation (point-cloud, voxel, mesh etc.), is a big bottleneck on the accuracy of the VLM. Intuitively, if the 3D encoder does not capture the salient features of the 3D asset well, then we have no hope of tagging or captioning the asset correctly. An ideal 3D encoder should be able to capture both the high-level characteristics of the asset such as its shape as well as finer characteristics such as its texture and material. We will discuss 3D encoders in much greater depth in the upcoming part 3 of our blog post series.

An equally important factor affecting the quality of the VLM is the quality and quantity of the paired 3D-text data that it is trained on. As we mentioned in our previous blog post, public 3D datasets are an order of magnitude smaller than image datasets and high-quality 3D-text data is even scarcer. This data scarcity can make training these models even trickier than their image-text counterparts. At Loci, we are collecting large, high-quality datasets of 3D assets with tags and captions in order to train our 3D VLMs.

To build a production-viable model, another aspect we need to pay close attention to is the size and cost of the model. The number of parameters of the LLM that we choose largely dictates the total size of our model. Even a relatively small LLM such as Llama3–8B (which has 8 billion parameters) takes up 32 GB in memory, while a large one like Llama3–70B takes as much as 280 GB! Performing efficient inference with these LLMs would require a GPU with an equivalent amount of VRAM which can be prohibitive. However, strong LLMs today are trained to perform a large variety of general text understanding tasks, while we are interested in performance on a small subset of tasks like tagging and captioning. This means we can usually get away with using smaller LLMs (even on the order of 1–4B parameters). We can also use tricks like quantisation to reduce memory requirements by a large fraction. We have observed that quantising models down to 4 bits results in almost no loss in performance, while reducing the memory requirement by 8 times compared to using full 32-bit inference.

There are further tricks we can use to optimise the runtime of these models, such as prompt caching. Depending on the use case, the text prompt to the VLM might be exactly the same in every inference call. We can pass this prompt through the VLM once and store the intermediate results or ‘activations’ in a cache. Further calls to the model then only need to process the new parts of the input, like the 3D asset, improving inference runtime.

As we can see from the example above, by using these tricks we can optimise both the inference time and memory requirements of a model while retaining its quality. At Loci, we place a strong emphasis on building efficient algorithms that can process large quantities of assets quickly without sacrificing performance.

Conclusion

The ability to automatically tag and caption 3D assets is an important stepping stone towards the broader goal of creating AI systems that can use language to interact with 3D models as naturally as we do. Models that can accomplish these tasks accurately and efficiently can unlock challenges that many sectors consistently face today, such as performing scalable search on vast libraries of 3D assets.

3D VLMs are a natural blueprint for solving these types of problems, by utilising the extreme power and versatility of modern LLMs combined with encoders that can accurately represent a 3D asset. However, performing these tasks well is still quite challenging because the available 3D training data is limited, generative language task evaluation is highly subjective, and compute requirements can be prohibitive.

At Loci, we are working on building efficient 3D VLMs that can perform accurately and reliably for a range of different tasks like tagging and captioning. We are overcoming some of these challenges by building large, high-quality and diverse datasets of 3D assets with their associated metadata, by utilising a range of practical tricks to reduce the memory and runtime cost of these models, as well as by developing better evaluation techniques to assess their accuracy.

References

[1] Achiam, Josh, et al. “GPT-4 Technical Report.” arXiv preprint arXiv:2303.08774 (2023).

[2] Touvron, Hugo, et al. “Llama: Open and efficient foundation language models.” arXiv preprint arXiv:2302.13971 (2023).

[3] Bordes, Florian, et al. “An Introduction to Vision-Language Modeling.” arXiv preprint arXiv:2405.17247 (2024).

[4] Liu, Haotian, et al. “Visual Instruction Tuning.” Advances in neural information processing systems 36 (2024).

[5] Beyer, Lucas, et al. “Paligemma: A versatile 3B VLM for transfer.” arXiv preprint arXiv:2407.07726 (2024).

[6] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.

[7] Dubey, Abhimanyu, et al. “The Llama 3 herd of models.” arXiv preprint arXiv:2407.21783 (2024).

[8] Laurençon, Hugo, et al. “Building and better understanding vision-language models: insights and future directions.” arXiv preprint arXiv:2408.12637 (2024).

[9] Xu, Runsen, et al. “PointLLM: Empowering large language models to understand point clouds.” arXiv preprint arXiv:2308.16911 (2023).

[10] Fu, Rao, et al. “Scene-LLM: Extending language model for 3D visual understanding and reasoning.” arXiv preprint arXiv:2403.11401 (2024).

[11] Yang, Senqiao, et al. “Lidar-LLM: Exploring the potential of large language models for 3D LiDAR understanding.” arXiv preprint arXiv:2312.14074 (2023).

[12] Qi, Zekun, et al. “ShapeLLM: Universal 3D object understanding for embodied interaction.” arXiv preprint arXiv:2402.17766 (2024).

[13] Zhang, Sha, et al. “Agent3D-Zero: An Agent for Zero-shot 3D Understanding.” arXiv preprint arXiv:2403.11835 (2024).

--

--

Loci

Loci's API generates data labels to unlock seamless search for your 3D catalog. For more information visit www.loci.ai