4 Key Trends in CVPR 2024

Published in

VESSL AI

6 min readJul 1, 2024

Even in the era of LLMs, there has been a steady demand for computer vision based on diffusion models. In recent years, image/video generation models, 3D reconstruction using NeRF, and multimodal learning have seen rapid growth. CVPR 2024 was held in Seattle from June 17th to 21st, where many CV AI-related figures and researchers gathered to share their knowledge and vision. Our team at VESSL AI also attended CVPR 2024 and experienced these trends firsthand.

Here are the highlights from CVPR that caught our attention.

1. Increased Size of Conference

Since 2016, CVPR has experienced noticeable growth albeit a drop during the COVID pandemic. However, CVPR 2024 finally surpassed pre-pandemic (in-person) attendance levels, setting a new record in CVPR history.

Additionally, there was a 25.96% increase in number of research paper submissions compared to 2023 (11,532 total). These impressive statistics just go to show the ever increasing interest in AI.

2. Image/Video Synthesis

Image/video synthesis and generation were among the most popular submission topics at CVPR 2024.

Since diffusion models have shown impressive image generation results, many researchers have focused on enhancing them. This was seen at CVPR 2024 as there were several submissions specifically addressing the limitations of diffusion models and new possibilities for generative models.

InstanceDiffusion adds more controllability to diffusion models. Instead of simply generating images from prompts, this model allows for precise control of each instance within an image: users are able to identify specific locations using bounding boxes, masks, points, or scribbles. This is made possible by the UniFusion module, which maps each instance’s location and text prompt into feature space and integrates them as visual tokens; ScaleU, which rescales main features and low-frequency components to maintain layout integrity; and the Multi-instance Sampler module, which provides enhanced control over multiple instances.

DeepCache is a method to improve diffusion models, making them generate almost lossless results faster. This is achieved via U-Net, which comprises of two branches: a main branch that calculates high-level features and a skip branch that fetches low-level features. Additionally, high-level features of adjacent denoising steps are very similar. DeepCache leverages this by caching the results from the main branch at specific points and uses these cached results to speed up the denoising process. As a result, images can be generated 2.3 times faster in Stable Diffusion V1.5 and 4.1 times faster in LDM-4-G.

BIVDiff is a training-free framework for general-purpose video synthesis. It combines specific image diffusion models with general text-to-video models which allows for effective video creation.

First, BIVDiff uses an image diffusion model to generate videos frame by frame. Then, it applies Mix Inversion to the video model for more consistent results. Finally, it performs temporal smoothing to ensure smooth transitions.

This method allows for the selective use of image models based on specific needs ensuring additional flexibility and high efficiency. BIVDiff can handle various video tasks, including generation, editing, inpainting, and outpainting — demonstrating versatility and general applicability.

3. 3D Vision

Since the advent of Neural Radiance Field and 3D Gaussian Splatting, research on creating 3D views from 2D views has been active. However, reproducing physical movements such as motion synthesis, required creating a mesh from a view generated by Gaussian Splatting and then rendering it with the desired movement. PhysGaussian was proposed to solve this. It creates 3D views, including physical movements, directly from 3D Gaussians. PhysGaussian also utilizes continuum mechanics principles and a custom Material Point Method (MPM) without the need for mesh rendering. Furthermore, it supports basic movements and flexible control of dynamics through material parameters.

Notable studies have emerged not only in 3D view but also in the field of 3D mesh generation. Wonder3D is a method to solve the problem of efficiently generating 3D from a single image. Existing mesh generation methods produce meshes with poor quality and limited geometric detail. In this study, color images corresponding to a multi-view normal map were generated with cross-domain techniques. This information was then used to make high-quality meshes through a geometry-aware normal fusion algorithm. As a result, it was possible to generate high-fidelity meshes from a single image in just 2–3 minutes.

4. Multi-modal Models

As LLMs become a major trend, multimodal language models are also attracting significant attention. In particular, the release of many Vision Language models (VLMs) highlights the growing need for proper evaluation. In response, Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark was released. The MMMU benchmark consists of 30 subjects including art & design, business, engineering, etc., and 183 subfields. Even more, it incorporates 30 types of heterogeneous images such as charts, tables, and chemical structures. Moreover, unlike existing benchmarks, MMMU is designed to evaluate more complex aspects of perception and reasoning.

As mentioned earlier, though the vision language language model (VLM) is being actively studied alongside the rapid growth of LLMs, the VLM improvements are not as substantial as those in LLMs. InternVL identifies a few possibilities for this: the vision encoder has not been scaled up sufficiently, the representations between LLM and vision encoder are not well-aligned, and that connection between the two is inefficient. To address these issues, InternVL scaled up the vision encoder to 6 billion parameters, trained it using contrastive loss with an existing LLM, and integrated it with large language middlewares like QLLaMA. As a result, the approach achieved state-of-the-art performance across 32 visual-linguistic benchmarks.

VESSL for Academic

Our free academic plan is dedicated to helping graduate students and faculty members set up a SLURM alternative job scheduler with zero maintenance overheads. Apply now to get access↗.

Run GPU-backed training jobs and notebook servers instantly
Integrate lab-wide clouds and on-premise clusters with a single command
Monitor GPU usage down to each node

At VESSL AI, we understand the evolving challenges in the field of computer vision and machine learning. By providing powerful tools and a supportive environment, we aim to empower researchers to overcome these hurdles, accelerate their experiments, and advance the state of the art in AI research.

Sanghyk Lee, ML Engineer

Kelly Oh, Growth Manager

TJ Park, Growth Intern