6 Days, 5 Key Takeaways: Computer Vision and Pattern Recognition Conference 2022

Nisan Chiprut

Lightricks

Published in

Lightricks Tech Blog

10 min readJan 17, 2023

This June, I attended CVPR, an annual event which gathers the best researchers and practitioners of computer vision from around the world. It was my second year there, and I wanted to summarize its many highlights for my colleagues at Lightricks, and our wider community. This article is my perspective on the big things happening in computer vision research, according to my experience at this year’s CVPR.

I’ll try to collate current trends and emphasize the big, promising advances in the field, while staying somewhat “zoomed out”, in order to give you the bigger picture. I’ve also linked to lots of detailed, more closely focused articles, so if you’re interested in a specific subject, there should still be plenty for you to dive into.

The article is divided into five key takeaways, based on an one hour lecture I presented to the Lightricks research group. So, if you’d like to squeeze six days of learning into one concise read, here are my top five takeaways.

THE TAKEAWAYS:

Takeaway #1: If it does 3D reconstruction, it probably uses Neural Fields

NeRF is an algorithm that takes multiple images of the same scene and produces images of arbitrary viewpoints of the same scene. NeRFs model a 3D scene using a radiance field which is represented by a neural network. Once the neural network models the scene, the rest of the process is straightforward and usually achieved via classical volume rendering techniques.

Over the last two years, a great deal of research has followed the original paper. A family of algorithms has now emerged, to the extent that, with so many variations, “Neural Field” is now a more fitting reference than the name “NeRF.”

A Neural Field is built of four main components which can be mixed and matched. Each combination has its own strengths and weaknesses. For more details see this survey paper (accompanied by this tutorial), which tries to organize this new emerging field. The researchers also built an awesome website, which should be bookmarked as your go-to reference for this subject.

For those looking to dive into the topic with a quick start, I’d recommend trying NeRFing yourself with nerfstudio. For the latest improvements, this repo also really shines, containing a combined implementation of three impressive papers : Mip-NeRF 360 (Oral) Ref-NeRF (Best Student Paper), and RawNeRF(Oral).

Lastly I want to shine a spotlight on two particularly outstanding papers:

1. Mind Blowing (but awfully slow)

Given plain videos as an input, BONMO reconstructs a fully articulated 3D model. Learning 3D deformable models without a prior (of a specific domain –like faces) was science fiction just a few years ago. Today, this general purpose, prior-free model is so much more than one would expect. The downside is the training time — it takes about 3 days on a GPU to fit one model.

2. Pretrained and real time(-ish)

Compared to vanilla NeRF, EG3D is more efficient due to its feature extractor. It doesn’t require multi-view data, due to its generative-adversarial mechanism. In this section of real time-ish neural fields, the concept of tri-planar feature extraction and tri-linear sampling for querying the nerf keeps showing up.

**Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut**

Takeaway #2: Pretrained self-supervised transformers are awesome

Last year, a line of work on very impressive self-supervised vision models based on the transformer architecture got us one step closer to the unification of vision tasks. Specifically, we saw an advancement in the ability to use expensive pre-trained networks for cheap fine tuning of downstream tasks. One important example is DiNO, which was trained on unlabeled data and showed an impressive ability to robustly and consistently segment objects in video. More importantly, its extracted features showed SOTA results at the time of publishing, when fine-tuning on various tasks.

Last year at CVPR, I was very interested in style transfer, but realized that most of these works still used VGG features to compute the loss. We were talking about this in 2021 when VGG was out in 2013 –didn’t we have any better feature extractors? There were few who tried to replace VGG with resnets [1] but the main motivation was the efficiency of the feature extractor and not its quality. This year, I finally saw a few papers with promising titles — Image Style Transfer with Transformers and CLIPstyler: Image Style Transfer With a Single Text Condition. However, if you look closely, you’ll see that the loss is still performing on VGG features.

Losses based on the VGG features have two main issues. The first is locality. Due to the CNN architecture, the features don’t have global information, which causes structure loss or “style content leakage”. Second, realistic image style transfer doesn’t work so well…

In Splicing ViT the authors showed how to use only DiNO-ViT as a loss to tackle these two issues. Most impressively, they finally use SOTA network architectures for semantically transferring the visual appearance of one natural image to another. I recognise that this is not style transfer per-se, but things feel close enough to frame them together.

The bigger picture here is the thorough analysis that’s been performed on the DiNO-ViT features. The authors show how the image structures and style can be reconstructed from the features. These properties are not only impressive but also straightforward and easy to understand. A followup paper by the same group showed how to further exploit DINO-ViT features for co-segmentation, point correspondence and more.

**Text2Mesh: Text-Driven Neural Stylization for Meshes**

Takeaway #3: We’re only scratching the surface with Multimodal contrastive learning

With the rise of the transformer, various modalities are processed similarly, via small tweaks on top of the transformer. As @russelljkaplan said, “If you can tokenize it, you can train a large language model for it.” One important breakthrough came from the ability to combine multimodal data in the same model. CLIP was probably the first to show a tremendous improvement on zero shot vision tasks.

Simply plugging CLIP instead of other feature extractors can achieve state of the art results on various domains (Visual Question Answering, Image Captioning, and Vision-Language Navigation). Its huge success is also accounted for by the huge amount of open source datasets, but it appears that CLIP training itself still requires lots of improvements and new CLIP-like models keep popping up thanks to the versatile open-clip repo.

In CLIP, we train two encoders — an image encoder and a text encoder. These are responsible for outputting the image feature vector and text image vector respectively. These encoders don’t share parameters or anything fancy — they’re only connected using the dataset and loss. One datapoint in a CLIP training set is a tuple containing image and its corresponding caption, and the loss objective is maximizing a corresponding image and text features similarity, while minimizing similarity with every non-corresponding couple in that batch.

How is CLIP performance measured? One common way of doing it is by measuring Imagenet zero-shot accuracy. To understand this metric we need to unpack two terms:

What is a zero-shot accuracy on a dataset? It’s the accuracy measured on a model which hasn’t seen a single example from the dataset while training.
How do we zero-shot classify images using CLIP? Let’s say we want to classify images of dogs versus cats using CLIP.We calculate the features for each of the texts “cat” and “dog” separately using the text encoder, and find the one which is more similar to the image feature extracted by the image encoder.

In LiT (blog) they found a recipe for training CLIP-like models which surpasses the imagenet zero-shot accuracy of the original CLIP by a large margin. They’ve achieved this simply by starting from a pre-trained image encoder and freezing it during the training. They also delivered a pretty big improvement when compared to models trained with billions of text-image pairs. So, even though increasing the amount of data used for training improves performance, it’s still not clean enough to be better than a pre-trained supervised image encoder.

Takeaway #4: There are a bunch of creative ways to use positional embedding

Let’s look at some of the most exciting topics emerging in the last year. Neural Fields, Diffusion models and Large Language models –what do they all have in common? You might say “transformers” but that is only true most of the time. There’s another small but very important concept which lies at the core of each of these new advancements — positional embedding.

Positional embedding is a solution to the position invariant issue with the attention layer (without it, a transformer was as good as a bag of words!) — but it’s also a solution to a fundamental issue with vanilla coordinate based vision tasks.

There are many variants of positional embedding, but the essential idea is to inject position knowledge to a token. For example, a single word position in a sentence, or a tuple of coordinates represents image patch spatial position. The embedding is usually done by lifting the position to a higher dimension using sinusoidal augmentations. See this video for more details.

Another cool application of positional embedding is described in the following line of work, but in order to understand it lets focus on StyleGAN (a generative model which, when given a random input, can generate a random image from a specific domain e,g, faces, bedrooms, etc.)

One important property of styleGAN is its controllability. Its input is structured from three parts — (1) style vector (2) structure noise vector and (3) a constant matrix (which is upsampled using convolutions to the output image). If you haven’t seen it yet, this video shows how these controls interpolate smoothly between faces.

Another of StyleGAN’s interesting properties is that it is structure aware. It doesn’t output faces with a single eye, or three ears — even though it doesn’t have global information, just CNN with local connections. In Positional Encoding as Spatial Inductive Bias in GANs, the authors hypothesize that this property is possible due to StyleGAN’s constant input matrix. They changed the training of StyleGAN to use explicit positional embedding instead of a learnable constant matrix and showed new controllability on the generated images — arbitrary scaling.

Arbitrary-Scale Image Synthesis extended their work and used positional embedding for training in multi scale and with translation. They explicitly constrained the generator output to match the input position embedding.

The result is a scale and translation invariant generative model which is also generalizable to other non-linear warps of the positional embeddings.

**Dataset Distillation by Matching Training Trajectories**

Takeaway #5: Scale, scale, scale, distill, distill, distill

There’s no doubt that scaling still hasn’t reached its peak in machine learning — either in relation to data, compute or model parameters. Tom Goldstein’s tweet is a great summary of how things got out of control there.

So the bad news, at least for the non-giants, is that large models are here to stay for the near future, i.e. the best models will cost a lot of money — both during training and for inference.

It’s long been believed that this big redundancy is not needed, it’s just that we don’t know how to effectively and efficiently train the smaller models. In order to reach it we have to train a large model and then shrink it in some way.

Among the three leading methods for decreasing model size (distillation, quantization and pruning) distillation is arguably the most mysterious approach, where a small student network learns from a larger teacher model.

In Knowledge distillation: A good teacher is patient and consistent they show that it’s not only possible to shrink the model size significantly, it’s also possible to improve student performance over the teacher. The approach is super straightforward and provides a framework where overfit doesn’t seem like a possibility. The bad news here is that distillation takes a lot of time, and by extension, lots of money. Make sure to check out this repo which also bundles many other big vision models.

In conclusion

The field of 3D reconstruction is dominated by the use of neural fields, which have proven to be superior to classic methods in many use cases, and have been applied to a variety of tasks that classic solutions could not solve naturally. This field is still in its early stages and is likely to continue to grow and evolve, as new ideas are developed and implemented. The good ideas are still scattered, but as the field matures, we’ll see much better and more consistent results.

Meanwhile, large language models continue to have a significant impact on the vision community, with the two dominant approaches being the use of self-supervised learning on unlabeled image data and the use of Multimodal data to combine different modalities. It’s amazing to see how easily ideas from LLM and foundation models are applied to vision, following the lead of natural language processing. This suggests that the need for scaling in vision tasks is only going to increase, and it is unlikely to go away anytime soon as the field continues to advance.
—
Create magic with us
We’re always on the lookout for promising new talent. If you’re excited about developing groundbreaking new tools for creators, we want to hear from you. From writing code to researching new features, you’ll be surrounded by a supportive team who lives and breathes technology.
Sounds like you? Apply here.

6 Days, 5 Key Takeaways: Computer Vision and Pattern Recognition Conference 2022

Nisan Chiprut

THE TAKEAWAYS:

Written by Lightricks