The Best of Vision
#CVPR2022 “unofficial” poster awards
After a whirlwind of a week at IEEE’s Computer Vision and Pattern Recognition (CVPR) conference, I wanted to use this essay to quickly take stock of all of the amazing research that was shared by the computer vision community. There are so many thresholds of state of the art being leapfrogged into new territories of capabilities in the last year that it has become difficult to keep up. This is an understatement.
Only a short decade ago, neural networks had just began breaching the boundary beyond shallow architectures to the realm of deep learning, enabled by parallelized matrix multiplications on GPU hardware and proven beyond any doubt by what was then considered a challenging ImageNet classification benchmark. The dominant architecture for vision applications was (and to some extent remains) the “convolutional neural network”, built specially for evaluating pixelated image representations.
In the time since, vision based learning began branching into all kinds of subdomains. For example, beyond the most basic classification applications, researchers begun seeking to:
- segment images by interpretable characteristics
- locate boundaries of detected objects
- identify specific people or objects (like facial recognition)
- find a navigation policy for an image sensory environment by reinforcement learning
- generate new images that resemble characteristics of a training corpus
- translate images between styles (like the difference between a photograph, a Monet, or a Picasso)
- extrapolate between images or along latent vectors (like a “smile vector”)
- generate figures that resemble real people (like actors)
- extend static image learning to time series (video) applications
- extend two dimensional algorithms to 3D applications
- use image interpretations to operate machinery (like self driving cars)
In more recent channels of inquiry, a noticeable trend has been breaching the boundaries between modalities and shrinking the scales of training data requiring annotation. For example, new capabilities include:
- “transformer” architectures applied to vision tasks (resembling attention modules originally applied to natural language applications)
- contrastive learning for extracting signal from an unlabeled training corpus (by way of masking / predicting infill to entries)
- leverage foundation models trained on large corpora (either for few shot learning applications or fine tuned to a micro domain)
- use input in one modality to generate content in another (e.g. text to image, or image to text)
- leverage diffusion denoising models for generating images at enhanced resolutions and without “mode collapse”
- estimate 3D representations from two dimensional imagery
I attended the CVPR conference as somewhat of an outsider, without a phd or enrollment, without a trained convolutional network to my resume, and without any real contacts established in the visual learning domain. I am surprised they let me in. Part of the reason for my interest was associated with my channels of “research” (blogging) targeting domains like deep learning theory and tabular modalities. (Although common architecture conventions applied in image and tabular modalities are quite different, at their core images are just tensors and so are dataframes, just tensors of different dimensions.) Plus since I had gotten so much value from attending research conferences online throughout the pandemic I didn’t want to get left behind as everything went back face to face. Throw in a soft spot for the New Orleans music scene and yeah it just felt right, added expense be damned. (The aggregate price difference between virtual and in person attendance was the hardest part to justify, decided to follow my gut).
In fact it had been so long since I attended an in person research conference I kind of had forgotten how. From all of the online stuff I had grown accustomed to pre-recorded keynotes and presentations, basically empty vendor expos, an occasional “social” (more presentations), and otherwise the chance to zoom chat and ask questions with various researchers presenting their work (the best part). Although the online social aspects were certainly lacking, the ability to swim through oceans of state of the art research was without comparison. So many papers and posters, each relaying independently verified novelty and significance. It inspired me to follow this “career path” (if one could call it that) and attempt to document some research of my own.
Having already published 3(!) essays inspired by the conference, it struck me that I had omitted the best part in my write-ups, the vast field of unique research, offered to the crowds succinctly summarized in poster length snapshots and presented by an original author’s summary and answered questions, basically the whole reason for the show. So yeah I hereby present the “unofficial” poster awards, selected based on the simple criteria that they happened to catch my eye as I was walking by for whatever reason. My tastes tend towards the universal, and as such I expect very important research specific to the image modality may have been omitted herein because I didn’t see the relevance to my unique tastes. In other words, these highlights, however “prestigious” they may be, merely represent a snapshot of this blogger’s attention at various times of the week. Many of these works I have yet to study in depth and am basing my takeaways on the displayed posters or presentations thereof. I offer them to the reader with the expectation that the published researchers, many of them enrolled graduate students, I find are usually thrilled to achieve increased visibility of any kind for what in some cases represent the culmination of months or even years of hard work. If any original authors are unhappy with their representation please feel free to contact the blogger as you see fit and I will be happy to strike an entry. Yeah so without further ado, I present to you the “unofficial” CVPR poster awards.
Day 1: The Sunday Awards
Posters on Sunday were kind of out in the main hallway, like I think these may have been affinity group workshop posters? I just saw a bunch of posters and took a quick walking tour, so yeah am assuming they were affiliated with CVPR in some fashion. This being my first day at the conference after a fun weekend of sightseeing (and photographing several of my favorite tourist haunts) was kind of just playing it by ear. Long story short (too late) there was one paper that caught my eye, mostly due to relevance, or at least adjacency, to my tangent interests in the tabular modality.
AnoDDPM: Anomaly Detection with Denoising Diffusion Probabilistic Models using Simplex Noise by Julian Wyatt, Adam Leach, Sebastian Schman, and Chris Willcocks
The discussions here might give you an idea of what to expect from rest of this essay. I did not spend enough time with the poster or the paper to authoritatively summarize the possibly important contributions related to vision anomaly detection utilizing denoising diffusion models (in my defense I had just learned about such models that same day). The reason the poster caught my eye involved reference to aspects of work dating back to the 1980’s, specifically what is known as Perlin noise and Simplex noise, which were new concepts to me at least. Apparently Perlin noise was developed for visual effects and originally served the purpose of adding more realistic textures to computer generated graphics — Perlin even won an Oscar for that use case. Simplex noise was developed several years later to improve on this same algorithm. As kind of a hand wavey sketch, these noise profiles utilize randomized gradients at a distinct frequency and amplitude range. It appears this poster found that they could adapt conventions of Perlin noise to denoising diffusion models (which traditionally are built on foundations of Gaussian noise profiles) by applying multiple frequencies of Perlin or Simplex noise which results in a Gaussian approximation. (Was mostly trying to imagine what it might mean to adapt noise injection use cases from my prior work to the image modality, still kind of fuzzy on details of what that might look like.) Cool well congratulations on your award Wyatt et al.
Day 2: The Monday Awards
On day 2 I didn’t get around to viewing any posters at all, so hereby award all Monday posters collectively. Good job guys.
Day 3: The Tuesday Awards
Tuesday was the first day of the main proceedings, in which the “official” paper awards were presented. (These awards are non-overlapping with the official ones). Additionally, selected authors from the rest of the field were allowed oral presentations in blocks grouped by theme, which was a recurring agenda item through the remainder of the conference in concurrent meetings (choose your own adventure style) as the program worked through a large portion of the ~2,000 accepted papers from a competitive field. (Only about 1 in 4 submitted papers was accepted after a prolonged peer review process). The field included a wide international representation, although the specter of multi-week quarantine periods after returning from international travel meant that some countries’ participants mostly opted for a virtual attendance track. (The roughly 10,000 registered attendees were close to equally split between online and virtual attendees, read into that what you will.) From a diverse field I found two posters with adjacency to my interests, so yeah congrats on the awards guys:
Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Distribution and Score Matching by Kwanyoung Kim, Taesung Kwon, and Jong Chul Ye
So it’s kind of funny how I ended up at this poster. I’ve been building software to inject stochasticity in context of tabular modality and kind of just perk up my ears whenever I see papers with “noise” in the title, basically a little bit of keyword bias. It turns out “denoising” as used here is becoming of mainstream interest in vision modality for applications like denoising diffusion generative models (or so I presume since the poster didn’t mention it explicitly). In common practice denoising diffusion models first learn a noise model by additively perturbing images, and then repurpose that same model for purposes of denoising, in most cases leveraging properties of the Gaussian distribution. One of the highlights of this poster was extending support of denoising models for other types of univariate distributions by leveraging a meta distribution that is parameritized to potentially translate between each. Specifically I am referring to their use of the “Tweedie distribution” which can be parameterized to potentially represent multiple forms like Gaussian, Poisson, Gamma, etc. Although traditionally Tweedie can only be represented in a closed form for special cases, this poster demonstrated what they referred to as the “saddle point approximation” which achieves blind noise model estimation with only one extra inference step, and yeah long story short: abba codabra (paraphrasing). Nice work guys.
Delving into the Estimation Shift of Batch Normalization in a Network by Lei Huang, Yi Zhou, Tian Wang, Jie Luo, and Xianglong Liu
As a student of probability I found this work quite interesting because of the close inspection of a common component of neural architecture across basically all modalities. Referring to the use of batch normalization, which is a normalization applied to the activations across a neural network layer during training and inference. An unstated assumption often takes for granted that the test data passed to inference will match distribution properties of training data, however in cases of real world distribution drift, batch norm characteristics realized during training may result in a performance impact to inference. One of the key findings is that this mismatch of population characteristics can accumulate through layers, and be particularly impactful in the deeper layers of a network. The poster suggested that alternating with a form of batchnorm applied on a basis independent of batch characteristics can help to offset this degradation through layers. Kind of reminds me of some related work but can’t remember what. Yeah good job Huang et al.
Day 4: The Wednesday Awards
Things started to pick up a little on the poster front as I got into this second day of the main conference. I think was finding my stride so to speak (that and starting to fall into a better routine of breakfast portions and coffee servings for the generously catered venue — yeah food was great, my complements to the conference center, now if someone could just teach them a new gluten free recipe :). The awesome day of browsing culminated in a festival of sorts, with a brass band, a dance floor, and no shortage of cajun foods to choose from. Kind of like Mardi Gras only without the stifling crowds and bad decisions (I’m more of a Jazz Fest guy tbh ;). Many more poster highlights to choose from, let’s get to it:
Input-Level Inductive Biases for 3D Reconstruction — Towards 3D inference with general purpose models by Wang Yifan, Carl Doersch, Relja Arandjelovic, João Carreira, and Andrew Zisserman
One of the reasons that a specialty like computer vision can have thousands of paper submissions year after year is that researchers have a tendency to overfit to an application or particularly to a benchmark. I use the word overfit in the abstract sense as opposed to the formal definition, referring to creating unique architectures for each benchmark, with a tweaked set of inductive biases to align with that specialty. This poster sought to identify a more common set of inductive biases that can be integrated as a part of the data input, realizing a more generalizable form of data encoding for real world applications, particularly involving 2D cameras recording 3D space. By formalizing a data encoding that appends onto pixel values commonly available measurements like camera position, direction, and angular coordinates, it makes all of those nifty use cases needing 3D interpretations all the much easier without needing to reinvent the wheel every time you want to drive down a new road so to speak. My advice to the presenter is that she abstains from clicking three times together those sparkling ruby slippers lest she miss out on all of the gold bricks to come. Congrats on the award.
Fairness-aware Adversarial Perturbation Towards Bias Mitigation for Deployed Deep Models by Zhibo Wang, Xiaowei Dong, Henry Xue, Zhifei Zhang, Weifeng Chiu, Tao Wei, and Kui Ren
Ok, sheepish grin, here we are back at another case where I saw a keyword from my work, “perturbations”, and wanted to check out an interpretation for the image modality. So yeah this one is actually kind of interesting. When vision models are evaluating scenes with sensitive attributes that may be bias magnets — you know the whole grab bag of gender, ethnicity, race, and etc. — well the idea here is we have all of these nifty generative tools to shift image samples along latent vectors — why not as a form of preprocessing neutralize all of those sensitive attributes that we may not want a model to account for. Sure it may be a little cumbersome for real time applications, but in the context of some sensitive domains with static images could be a valid way to mitigate bias. And that is something that we all want. Thanks for your work Wang et al.
FedCorr: Multi-Stage Federate Learning for Label Noise Correction by Jingyi Xu, Zihan Chen, Tony Quek, and Kai Fong Ernest Chong
Federated learning is one of those subdomains that traditionally is of most interest to, how should I say this, tech industry behemoths. It refers to models trained in a distributed manner, where data could be sourced from edge devices in a piecemeal fashion and models trained in a privacy preserving, distributed manner. Kind of like if you had over a billion smartphone users and wanted to find a way to update a model from the lot of em. I mean there’s probably smaller applications as well, I expect these are the driving use cases attracting research though. This poster sought to find ways to mitigate label noise, which if you’ve been following the “data centric AI” space you may have seen noisy labels can have potential to increase scale of training data requirements for comparable performance by sizable amounts. One of the interesting pieces here, and sort of validating to some of this blogger’s related work, was discussions surrounding impact of noisy labels towards intrinsic dimensions of the data set — or particularly what they refer to as “local intrinsic dimension”, which I expect is associated with locally derived measures corresponding to the federated learning segments in application. Basically label noise was found to increase the local intrinsic dimensions. Yeah I’ll keep chasing alternate keyword interpretations as long as you keep throwing ’em at me. Thanks for your poster Xu et al.
Non-isotropy Regularization for Proxy-based Deep Metric Learning by Karsten Roth, Oriol Vinyals, and Zeyneop Akata
This poster worked on two levels. From a state of the art standpoint, it introduced non-isotropy regularization, a normalizing flow based efficient form of data translation that converges noisy sample groupings to a more uniform embedding space with more diverse clusterings, all while preserving local structures. And what might you say is this useful for? Well that’s the other thing I learned about from the poster, which is sort of more basic but also kind of eye opening. “Deep metric learning” refers to the derivation of a measure of domain shift. Examples of domain shift could be like for self driving cars being deployed in a new city, or say for medical applications could refer to the domain shift from a patient’s micro demographic that wasn’t seen in training. Yes on its own that sounds nifty, but what’s really cool is that the conventions of deep metric learning can be deployed even in cases where the shifted domain is novel to the algorithm, like for cases where we’re faced with samples that fall outside of a training distribution. Apparently deep metric learning is a simple solution to derive an expected inference validity in presence of domain shift. And non-isotropy regularization makes it work better and converge faster with small overhead. Good work, thanks for sharing Roth et al. “Deep metric learning”, perhaps you should write that down somewhere.
Day 5: The Thursday Awards
Was really getting into the swing of things by day 5. Had a newly stocked streaming playlist to enjoy in hours between (jazz, blues, traditional, and new wave to name a few — the party never stopped). Started to branch out to a few live music venues (Frenchmen Street is top notch), yeah and just began to feel a little less like a tourist and more like a local (just a little anyway). Every Uber ride was an adventure — but you know, in a kind of kooky colorful way. My landlord was an artist in his own right, every time I looked I saw something new. Very subtle. And it was with some dismay that I started to realize the week was past midway. Fortunately there was no shortage of posters to keep me occupied.
The Neurally-Guided Shape Parser: Grammar-based Labeling of 3D Shape Regions with Approximate Inference by R. Kenny Jones, Aalia Habib, Rana Hanocka, and Daniel Ritchie
Ok let me just get this comment out of the way. It was almost weird just how many posters detailed their craft by way of demonstrating semantic segmentation of chairs. You know, like here are the legs, here is the back, seat, and so on. Apparently chairs are the new MNIST. This poster was no exception. What stuck with me though was the presenters discussions surrounding hierarchical labels, which in the case of chairs could be like a foot is a part of a leg is a part of a base is a part of a chair. Something like that anyway. And I learned that apparently in mainstream practice, training is often conducted to treat such hierarchies without any internal representation, simply labeling each segment individually and deriving label specific models. Strikes me as wasteful. This poster at least tried to improve on that convention by inferring probabilistically semantic segmentations along a hierarchy. Keep at it guys, some day we shall jointly train chairs -> furniture -> antiques -> etc. Just kidding. Thanks for the poster Jones et al.
Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be Trusted for Neural Architecture Search without training? by Jisoo Mok, Byunggook Na, Ji-Hoon Kim, Dongyoon Han, and Sungroh Yoon
Ok it took me a few minutes to figure out what was going on here. Neural Architecture Search refers to iterating through neural network architectures to optimize to an application, such as may make a model run faster on specific hardware, to lower the memory requirements of inference, all of that kind of stuff. The really interesting insight of this work was that by adapting evaluated architectures to an infinite width equivalent form, they could leverage the neural tangent kernel approximation such as to rank architectures without training (ok that’s an oversimplification, I think they did like a small number of epochs, still nearly equivalent). Was their conclusion that the kernel could be trusted? Why yes it was. Thanks for the suggestion Mok and team.
Towards Robust Vision Transformer by Xiaofeng Mao, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue
So this was a very practical study. I’m probably oversimplifying here, but in brief the authors conducted a survey of various components of common transformer architectures and sought to evaluate their sensitivity towards targeted deviations, or in other words they sought to evaluate how robust architecture components were to variations in configuration. (For whatever reason, this kind of reminded me of work I had seen from the Fastai crew for evaluating endurable parameter sensitivities across applications). Yeah just a solid work that could become useful to practitioners. Thanks for the contribution Mao and friends.
The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization by M. Jehanzeb Mirza, Jakub Micorek, Horst Possegger, and Horst Bischof
Apparently this paper was so interesting I had to blog about it twice. I mean sort of, really just came across it when it was being presented at the Vision for All Seasons workshop and then again here on the main floor. It is interesting work, addressing a new way to adapt models in cases of domain shift, but specifically by tuning just the batchnorm parameters in lieu of a full set of weights. By adapting batchnorm instead of weights the model can adapt without that whole “catastrophic forgetting” thing that challenges most model tuning conventions. Which may be important. And further, they even demonstrate adapting in an online mode by updating a single epoch, for which the poster benchmarks showed positive impact. Yeah so any time you see a new way of doing things that is so immediately intuitive and probably really simple to implement, it probably was worth a second writeup after all. Good work Mirza and crew.
Scaling Vision Transformers by Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer
Ok shameless self promotion time, I hope the authors may forgive this indulgence. Partly inspired by seeing how much progress was being made in empirical studies surrounding phenomenon in the overparameterization regime, I just collected a few thoughts in a related matter that basically attempted to lay out the foundations of a new way of thinking about the mechanics of overparameterization. True story. It is available on my blog as Dimensional Framings of Overparameterization, and I owe a thank you to Neil Houlsby and crew for lighting a match to get that one started. Regarding this poster, it was rigorously demonstrated that findings for benefit of balancing [scale of data and model parameterization] towards an optimal [training cost vs performance] that was previously demonstrated by others for large language models are now verified for the image modality. Which is intuitive. I wish I had the resources to conduct empirical experiments of this nature. Party on Zhai and company.
Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation by Qingyan Meng, Mingqing Xiao, Shen Yan, Yisen Wang, Zhouchen Lin, and Zhi-Quan Luo
Will keep this brief. Meng and collaborators have identified a new means to translate spike neural networks to a differentiable form for training. Spike networks, which resemble aspects of organic neuron firing patterns, may someday become more prevalent on edge devices due to efficiency and speed. File this under probably worth checking back in down the road. Thanks for your work guys.
Towards Multimodal Depth Estimation From Light Fields by Titus Leistner, Radek Mackowiak, Lynton Ardizzone, Ullrich Köthe, and Carsten Rother
Ok truth be told my notes weren’t very good on this one, am just going to paraphrase aspects of the abstract I suppose (this must be the “lazy training” phenomenon I read about for overparameterized models :). Vision applications in real world use cases like self driving cars often will try to adapt a 2D image to a depth map, like to estimate say how many car lengths are between you and that pedestrian walking down the street. That sort of thing. And in practice these type of depth maps can be impaired by various kinds of interferences, like the reflections on glass surfaces or puddles, semi-transparent objects, that sort of thing. This poster sought to demonstrate that by lifting the requirement of only estimating one depth value per pixel, a secondary depth could provide a posterior distribution that can then in turn make the primary distribution estimation more accurate. Oh and they’ve got a shiny new dataset to go with it. Pretty neat stuff Leistner and friends. Keep up the good work.
Knowledge distillation: A good teacher is consistent and patient by Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larissa Markeeva, Rohan Anil, and Alexander Kolesnikov
Noting the amusingly passive aggressive paper title, it was funny that this poster was not only belittling some ornery phd adviser, but also their own peers. Apparently machine learning practitioners don’t know what is good for them. We build them these hugely parameterized models that perform better — no let me restate that, that perform a lot better. And the only tradeoff is memory overhead and latency. A small price to pay. But somehow usage statistics suggest that even with these benefits, ML practitioners, when given a choice, will often accept the tradeoff and opt to download low memory solutions. Tsk tsk tsk. Fortunately Beyer’s team is hard at work to improve model distillation techniques, making those big models more manageable, even if they themselves are unmanageable to their phd advisor. Good luck to em.
Day 6: The Friday Awards
It is hard to overstate just how rapid has been the progress in the field of artificial intelligence in the last few years. Every benchmark is being left in the dust. It is only accelerating. These conferences almost serve as historic records, by the time a paper makes it to publication through months of peer review it is often outdated, already superseded by more advanced work. I have heard it considered jokingly that at the current exponential growth rate of paper submissions to machine learning research conferences we can extrapolate that within the next fifteen years every human on earth will be submitting their own conference write-ups. You know what? Perhaps that would not be such a bad thing. If there is one lesson that this blogger has learned from the practice of writing papers is that the rigor of a formal writeup is the single most productive channel of addressing problems and building new solutions. Writing papers should not just be the currency of academics, every business and institution could learn from these practices. And guess what, these phd candidates are very good at it (perhaps you should hire a few). Forget the citation counts, they are partly a winner take all metric. It is not just what you teach others when writing a paper, it is what you teach yourself in the process. Just consider what these presenters learned along the way:
Reversible Vision Transformers by Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong, Christoph Feichtenhofer, and Jitendra Malik
Although I am admittedly an outsider to applications in computer vision, this work strikes me as potentially important. Reversible vision transformers mean that the multi modality regime can be navigated independent of direction. In other words, the same model can consider image to text conversion or text to image, all without fine tuning. I don’t know if it has been done before, but you know what, either way it appears to be a good idea. Thanks for your poster Mangalam and company.
SelfD: Self-Learning Large-Scale Driving Policies From the Web by Jimuyang Zhang, Ruizhao Zhu, and Eshed Ohn-Bar
So much of machine learning is contingent on suitable training data. Sourcing data is hard, and sourcing annotated data is considerably harder. The latest paradigms of language models have succeeded in bypassing the whole labeling convention altogether, they take raw data and derive models in a self supervised manner. I don’t know if there is yet any clean solution for such a massively scaled approach in time-series vision applications (videos). That being said, for specialty applications, like that of automobile navigation, there are current working methods even today. This poster demonstrated means to take aggregated youtube content of recorded driving videos, which are becoming ever more prevalent as more cars carry cameras, and model driving policies from the feeds. This blogger speculates such policies could eventually be adapted to rank drivers by their demonstrated prowess, and in a few years who knows perhaps such youtube derived models can be adapted within mainstream self driving platforms. We shall see. Thanks for your work Zhang et al.
DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection by Yingwei Li, Adams Wei Yu, Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng, Junyang Shen, Yifeng Lu, Denny Zhou, Quoc V. Le, Alan Yuille, and Mingxing Tan
Every car manufacturer is pursuing self driving vehicles in some form or another, and nearly all of them have a unique sensor stack. Commonly, sensors like cameras and LiDAR imagery are tracked separately to be fused within a learning algorithm, which in the process must overcome positional discrepancies of sensors, time lags of acquisition, and other imprecisions. The DeepFusion work adapts to such obstacles by way of a cross-attention aggregation between modalities. Rule of thumb: if you haven’t yet tried solving an application with integration of attention modules, hurry up and do so. You may be surprised. Thanks for the work Li and company.
RIDDLE: Lidar Data Compression With Range Image Deep Delta Encoding by Xuanyu Zhou, Charles R. Qi, Yin Zhou, and Dragomir Anguelov
Ok I didn’t follow along with everything, but the gist I got was the LiDAR sensors, which are often used in self driving vehicles, have a dense memory overhead — after all they are in effect collecting in real time 3D models of the surrounding environments. Which weighs more, a painting or a sculpture? It is a riddle, which seeks to compress LiDAR modalities to the mere weight of fine grained sand. In other words, LiDAR images are translated to quantized images are translated to residual maps are translated to entropy encoded bit streams. So light that they may fit in the glovebox of your self driving vehicle. Good job Zhou and team.
Eigenlanes: Data-Driven Lane Descriptors for Structurally Diverse Lanes by Dongkwon Jin, Wonhui Park, Seong-Gyun Jeong, Heeyeon Kwon, and Chang-Su Kim
If you want to think like an AI, you kind of need to set aside all of your organic biases and consider that every aspect of your surroundings, from the air you breath to the chair in which you sit, has a potentially equivalent representation by a set of numbers in a high dimensioned tensor. As a silly example if it helps to imagine, picture that scene in the Matrix when Neo starts to see all of the glowing green numbers in the subway, a subway tensor. That is closer to what the algorithm sees anyway, after all training and inference at their most fundamental level basically amount to exotic forms of ginormous matrix multiplications. Linear algebra. How can we leverage such insights, which as considered by this poster are in the high value application of lane detection for self driven vehicles? The work of Jin et al seeks to tame the potentially diverse and at times feint signals of where exactly are the lane boundaries on novel roadways. After all the paint may be faded, there could be puddles or mud in the road, Kramer may have even decided to widen the lanes by painting over the lines. There are endless edge cases. This poster demonstrated that by performing a type of linear algebra translation of inferred lane tensor representations they can translate to an eigenspace more interpretable by ML. What’s that, you can’t cross Matrix and Seinfeld jokes in the same passage? Perhaps someone should tell this blogger to stay in his lane.
Coopernaut: End-to-End Driving With Cooperative Perception for Networked Vehicles by Jiaxun Cui, Hang Qiu, Dian Chen, Peter Stone, and Yuke Zhu
Ok first as a quick acronym clarification, I think the cooper as used here isn’t pronounced in the “Hanging with Mr Cooper” sense (sorry very obscure reference), it is more about cooperative learning. As in how do we teach self driving cars to talk to each other. Can we just let them honk the horn like people do? I mean we could, and that will likely be the way self driving vehicles will interact with people while we still share the road, but once an AI can speak to another AI, why limit ourselves to one obnoxious beeping sound in our vocabulary? The researchers of Coopernaut are trying to teach cars how to converse. In real time. Which means we need a formalized vocabulary, wireless standards, the whole nine yards. After all wouldn’t you want your robot to know exactly what the robot in the next lane intends to do next? Good job Cui and team.
Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation by Ziad Al-Halah, Santhosh Kumar Ramakrishnan, and Kristen Grauman
The claims being presented in this poster are on their surface kind of subtle, but if you read between the lines potentially quite significant. Reinforcement learning applications commonly require training models unique to each subdomain, e.g. the difference between navigating a kitchen, a roadway, picking up an object, etc. Al-Halah and collaborators have apparently demonstrated potential for zero shot knowledge transfer between domains for such reinforcement learning led policies. One model that can be applied to any domain. That is huge. How might you ask do they accomplish this? I don’t know. They mention a form of semantic search that can be generalized across image-goal navigation tasks. The specifics are probably extremely valuable but you’d have to dig into the paper’s fine print, please consider this a homework assignment. Thanks to researchers for sharing their work, apologies if I did not do justice to the details with this introduction.
Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks by Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Hongsheng Li, Xiaohua Wang, and Jifeng Dai
Ok in case you couldn’t tell by last few write-ups, this was the part of the conference when the blogger started to get out of his depth. Students of machine learning are first trained on the fundamenetals of supervised learning. You have a training corpus specific to some application and modality — say tabular data, computer vision, time series, all that kind of jazz. The future generations of learning will overcome such simplified conventions by way of integrated assessment across modalities. The same model may perceive and interpret visual, audio, or other cues. Kind of makes you wonder what we will do with a dedicated conference to computer vision. Zhu et al present here a multi-modal framework for few shot or even zero shot learning — which refers to models that can adapt to new applications or domains with very few or even zero examples demonstrating how to do so. This is the kind of capability that has already been demonstrated in large language models specific to the natural language domain, now it is being extended both here and by others across modalities. This is what the future of AI may look like, I hope you are paying attention.
AlignMixup: Improving Representations by Interpolating Aligned Features by Shashanka Venkataramanan, Ewa Kijak, Laurent Amsaleg, and Yannis Avrithis
In traditional supervised learning workflows, data augmentation refers to the enlargement of a data set by duplicating training samples with representational deviations that may increase diversity of sample representations. For example, in the image modality, additional images may be generated by cropping, mirroring, tinting, noising, or other deviations — which in some cases may be leveraged for purposes of semi-supervised learning. Venkataramanan and collaborators demonstrate here a new type of data augmentation that leverages forms of generative modeling to add image representations that interpolates between image classes. Let’s say your data includes lions and tigers, that’s right, this approach will augment your data with depictions of ligers. The benefits include flattened class representations with smoother decision boundaries in the latent space. Thanks for the poster.
Globetrotter: Connecting Languages by Connecting Images by Dídac Surís, Dave Epstein, and Carl Vondrick
In this era of massively scaled natural language models trained on basically an entire internet’s worth of data, one emerging convention is that such pre-trained foundation models may effectively serve as a bridge between modalities. The work of Surís et al actually tackles this question in the other direction. Consider the wide variations in structure and form of natural languages spoken on different continents. I can’t even remember how many unique characters there are in written Chinese languages, and then you go to Europe and every word has a gender. The point is that perhaps the focus of this conference, computer vision, may be considered an even more universal language than language itself. After all, a rose by any other name would still be a rose. Thanks for the poster.
Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning by Richard J. Chen, Chengkuan Chen, Yicong Li, Tiffany Y. Chen, Andrew D. Trister, Rahul G. Krishnan, and Faisal Mahmood
The blogger is probably going to demonstrate some domain naivety with this paragraph, I would offer that the hierarchical vision transformers being discussed here have some resemblance to the type of tiered granular aggregations that take place in the convolutional architecture. After all, attention modules have many benefits, but scaling across extended vector spaces is not one of them. Chen and collaborators note the relevance to applications in medical imaging, where images may have just as much diagnostic information content at the tiniest scales of features in addition to the largest, and thus attention modules may have benefits for aggregating across scales in comparison to convolutions since they don’t have pooling between layers. (Consider that the overall shape of a tissue sample may be just as interesting as the cellular structure to support a diagnosis.) Today in medical imaging, tomorrow who knows perhaps the Webb space telescope. After all the differences between the large and small mainly depend on a frame of reference.
Integrating Language Guidance Into Vision-Based Deep Metric Learning by Karsten Roth, Oriol Vinyals, and Zeynep Akata
I noted in an earlier award the potential usefulness of deep metric learning for identifying cases of domain shift in the image modality even in cases of exposure to conditions not found in a training corpus. Deep metric learning. Man we should all be using this. Roth and company are looking to build on these metrics by leveraging large language models for semantic interpretations. Deep metric learning. Look it up. Thanks for your poster guys. Deep metric learning.
Omnivore: A Single Model for Many Visual Modalities by Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, and Ishan Misra
The term modality is kind of overloaded. In a broad sense modalities may refer to the distinction between images, text, audio, etc. In the context of the conversations at a computer vision venue like CVPR, modality boundaries may actually be more fine grained, like say the difference between cameras, LiDAR, radar, all of that kind of stuff. Each have their own conventions for data representations, fidelity, and ranged vision envelope. A common challenge for computer vision tasks is how to integrate or adapt between modalities in some application. The researchers of omnivore, in addition to eating a lot of steaks apparently, offer a solution that pre-trains on multimodal data sets to enable downstream adaptions only needing to fine-tune in a single modality. Well done Girdhar et al, this poster was rare to medium rare.
A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration by Ramya Hebbalaguppe, Jatin Prakash, Neelabh Madan, and Chetan Arora
If you’re like me and super interested in methods to improve probabilistic calibration properties, you may also be interested in this work. Although presented in a computer vision conference, I expect this approach could potentially adapt to all sorts of classification tasks. Hebbalaguppe et al offers a new approach to integrate an added regularizer into a loss function that represents a combination of average softmax activation per mini batch plus a binary 0/1 activation, with a corresponding tunable parameter. The integration appears to improve probabilistic calibrations in inference with only a small performance tradeoff. I didn’t follow along with the entire discussion, but the presenter noted that she had conducted benchmarks against label smoothing, another common convention for such use case. I would suggest that authors perhaps consider extending such label smoothing benchmarks to include the Automunge library’s “fitted smoothing” option (sorry shameless self plug). Yeah was a neat poster, congrats on the award.
EMOCA: Emotion Driven Monocular Face Capture and Animation by Radek Daněček, Michael J. Black, and Timo Bolkart
Translating from real human expressions and emotions to virtual 3D avatars is a challenging task. That being said, this poster was uninteresting and I was not impressed with the presenters.
AP-BSN: Self-Supervised Denoising for Real-World Images via Asymmetric PD and Blind-Spot Network by Wooseok Lee, Sanghyun Son, and Kyoung Mu Lee
In applications of image denoising, mainstream practice often applies a self-supervised blind spot network, which has an inductive bias towards zero mean noise (as would be expected when dealing with synthetic noise). In real world channels, sources of noise may be spatially correlated. This paper offers a solution via pixel-shuffled downsampling that effectively erases the spatial correlations of real world noise channels to make denoising more approachable. Good work on the poster Lee et al.
EMOCA: Emotion Driven Monocular Face Capture and Animation by Radek Daněček, Michael J. Black, and Timo Bolkart
Ok this is second referral to this poster. FYI the first writeup was an experiment to see if we could generate some new training data of unique facial expressions to support the authors Daněček et al in their interesting research for emotion driven face capture and animation. As a favor to them, if you are able to introduce them to this writeup, can you please stealthily record their facial expressions as they read the first in order to grant them some authentic training data of unique facial expressions? Good job on the poster guys, I hope you enjoyed the practical joke.
Day 7: The Saturday Awards
Saying goodbye is hard to do. Despite the fact that there were 5,000 people in attendance, over the week you started to recognize familiar faces. Each of these posters was a work of art in their own right. Every chart and graph capsulating months of dedication. Every punctuation considered with care, every word carrying significance. These researchers were not just doing this for fun. There were jobs on the line. Grants being considered. Postdoc and government appointments. Tenure decisions. Every person walking by their poster could be a decision maker. You just don’t know who you’re trying to impress, so the smart ones try to impress everyone. They try to convey matters of the highest complexity and towering barriers to entry in a manner that someone with no domain knowledge could spend five minutes looking over a few square feet of posterboard, and walk away with the insights earned from months of dedicated investigation. Because make no mistake, the academic publishing field is an international competitive environment of the highest order. It is like the Olympics only if you win you may create a new industry, or solve one of the riddles of humanity, or save millions of lives. These are the questions that are being answered, little by little, piece by piece, every paper pushes the frontiers of humanity forward, some in small increments and others in great leaps. We owe a thank you to each and all of them. Even the ones that didn’t make it into this “prestigious” blog awards program.
Oh yeah and also, next time you’re in town, please remember to tip the musicians.
Al-Halah, Z., Ramakrishnan, S. K., and Grauman, K. Zero experience required: Plug & play modular transfer learning for semantic visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17031–17041, June 2022.
Beyer, L., Zhai, X., Royer, A., Markeeva, L., Anil, R., and Kolesnikov, A. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10925–10934, June 2022.
Chen, R. J., Chen, C., Li, Y., Chen, T. Y., Trister, A. D., Krishnan, R. G., and Mahmood, F. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16144–16155, June 2022.
Cui, J., Qiu, H., Chen, D., Stone, P., and Zhu, Y. Coopernaut: End-to-end driving with cooperative perception for networked vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17252–17262, June 2022.
Daněček, R., Black, M. J., and Bolkart, T. Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20311–20322, June 2022.
Girdhar, R., Singh, M., Ravi, N., van der Maaten, L., Joulin, A., and Misra, I. Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16102–16112, June 2022.
Hebbalaguppe, R., Prakash, J., Madan, N., and Arora, C. A stitch in time saves nine: A train-time regularizing loss for improved neural network calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16081–16090, June 2022.
Huang, L., Zhou, Y., Wang, T., Luo, J., and Liu, X. Delving into the estimation shift of batch normalization in a network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 763–772, June 2022.
Jin, D., Park, W., Jeong, S.-G., Kwon, H., and Kim, C.S. Eigenlanes: Data-driven lane descriptors for structurally diverse lanes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17163–17171, June 2022.
Jones, R. K., Habib, A., Hanocka, R., and Ritchie, D. The neurally-guided shape parser: Grammar-based labeling of 3d shape regions with approximate inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11614–11623, June 2022.
Kim, K., Kwon, T., and Ye, J. C. Noise distribution adaptive self-supervised image denoising using tweedie distribution and score matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2008–2016, June 2022.
Lee, W., Son, S., and Lee, K. M. Ap-bsn: Self-supervised denoising for real-world images via asymmetric pd and blind- spot network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17725–17734, June 2022.
Leistner, T., Mackowiak, R., Ardizzone, L., Köthe, U., and Rother, C. Towards multimodal depth estimation from light fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12953–12961, June 2022.
Li, Y., Yu, A. W., Meng, T., Caine, B., Ngiam, J., Peng, D., Shen, J., Lu, Y., Zhou, D., Le, Q. V., Yuille, A., and Tan, M. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17182–17191, June 2022.
Mangalam, K., Fan, H., Li, Y., Wu, C.-Y., Xiong, B., Feichtenhofer, C., and Malik, J. Reversible vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10830–10840, June 2022.
Mao, X., Qi, G., Chen, Y., Li, X., Duan, R., Ye, S., He, Y., and Xue, H. Towards robust vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12042–12051, June 2022.
Meng, Q., Xiao, M., Yan, S., Wang, Y., Lin, Z., and Luo, Z.Q. Training high-performance low-latency spiking neural networks by differentiation on spike representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12444–12453, June 2022.
Mirza, M. J., Micorek, J., Possegger, H., and Bischof, H. The norm must go on: Dynamic unsupervised domain adaptation by normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14765–14775, June 2022.
Mok, J., Na, B., Kim, J.-H., Han, D., and Yoon, S. Demystifying the neural tangent kernel from a practical perspective: Can it be trusted for neural architecture search without training? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11861–11870, June 2022.
Roth, K., Vinyals, O., and Akata, Z. Non-isotropy regularization for proxy-based deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7420–7430, June 2022.
Sur ́ıs, D., Epstein, D., and Vondrick, C. Globetrotter: Connecting languages by connecting images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16474–16484, June 2022.
Venkataramanan, S., Kijak, E., Amsaleg, L., and Avrithis, Y. Alignmixup: Improving representations by interpolating aligned features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19174–19183, June 2022.
Wang, Z., Dong, X., Xue, H., Zhang, Z., Chiu, W., Wei, T., and Ren, K. Fairness-aware adversarial perturbation towards bias mitigation for deployed deep models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10379–10388, June 2022.
Wyatt, J., Leach, A., Schmon, S. M., and Willcocks, C. G. Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 650–656, June 2022.
Xu, J., Chen, Z., Quek, T. Q., and Chong, K. F. E. Fedcorr: Multi-stage federated learning for label noise correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10184–10193, June 2022.
Yifan, W., Doersch, C., Arandjelović, R., Carreira, J. a., and Zisserman, A. Input-level inductive biases for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6176–6186, June 2022.
Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12104–12113, June 2022.
Zhang, J., Zhu, R., and Ohn-Bar, E. Selfd: Self-learning large-scale driving policies from the web. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17316–17326, June 2022.
Zhou, X., Qi, C. R., Zhou, Y., and Anguelov, D. Riddle: Lidar data compression with range image deep delta encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17212–17221, June 2022.
Zhu, X., Zhu, J., Li, H., Wu, X., Li, H., Wang, X., and Dai, J. Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16804–16815, June 2022.