Top Deep Learning Papers of 2021

10 min readDec 19, 2021

We all hate long meaningless introductions to articles so I’ll go straight to the point. Here are some of what I’ve considered the most interesting and promising deep learning papers of 2021.

The idea is to explain them shortly and with a mix of very-easy/very-hard wordiness, this way the article can be somewhat useful to beginners and to more knowledgeable people. Having said that, and as every Italian plumber would say, here we go!

⚠️ Caution!

The choice of topics is a personal and very biased one so unfortunately they are going to cover more the Computer Vision topic than NLP, less GANs, etc…Are you still interested? Nice!

CLIP (https://arxiv.org/pdf/2103.00020.pdf)

Visual + Language Learning is trendy 📈! And the main responsible is an OpenIA paper that makes it easier to scale image recognition tasks because it does not require time-consuming-ImageNet manual labeling. It learns from a raw text instead of manual definite labels archiving State Of The Art results in several famous datasets.

Is it a new learning concept? Nope, but is the most ambitious one up to now. They’ve collected a dataset of 400 million image+text pairs to train State Of The Art models: modified Transformer architecture for text encoding and several ResNet-50, ResNet-101, EfficientNet and Vision Transformers (all modified) for image encoding. The best performing one being the Vision Transformer ViT-L/14.

How does it work? Easy. Contrastive learning. A well known technique for Zero-Shot and Self-Supervised Learning. Given a pair of an image with its text description, put them closer. Given a pair of an image with a wrong text description, put them far away. This way when querying an image with a sentence, the closer one is the “more correct” one.

N images with its N text descriptions get encoded with image and text encoders respectively so the get mapped into a lower dimensional feature space. Next, another map is used, a simple linear projection map from those feature spaces to a mixed feature space called the multi-modal embedding space where they get compared via cosine similarity (the closer the similar) using contrastive learning with positive+negative pairs.

CLIP is capable of tackling the problem of using multiple text representations for the same image, polysemy and outperforms the State Of The Art on some of the most famous datasets like ImageNet, CIFAR and Pascal VOC (while underperforming the SOTA in other ones like MNIST, Flowers102, KITTI Distance). Also, as its using contrastive learning, it’s a Zero-Shot learner and can generalize to unseen object categories better than previous Zero-Shot models.

Visualization of CLIP Zero-Shot classification

Diffusion Models (many implementations)

Let’s be real here, we all hate GANs. They have a very inestable learning that needs a lot of hours of fine tuning and the damn NVIDIA’s implementation of StyleGAN in GitHub it’s some infuriating bulls**t to use. Now that we all confessed our secret, we can say almost for sure that no-one is going to cry if they hear that GANs are not longer the State Of The Art for image generation and translation.

Are you talking about VQ-VAEs? Nope. Generative flows? Nein. I’m talking about Dr. Diffusion or: How I Learned to Stop Worrying and Love the Noise.

Samples of OpenAI’s Denoising Diffusion model https://arxiv.org/pdf/2105.05233.pdf

We can take an image of a cute dog and add some noise to it, we can still perfectly seeing the dog so lets add a little more, and more, and more until the initial dog image is unrecognizable and all you see is random noise. Well, if a very artistic person was witnessing of all the process of adding noise step by step, the artist would be able to revert the process at each time step so that the initial dog could be again recovered. Yay, doggo is back! 🐶.

Given a data distribution we can define a forward Markovian diffusion process that adds Gaussian noise at a time t until t is large enough that the image is nearly an isotropic Gaussian distribution so we can reverse the process step by step with the help of a neural network for the distribution approximation of the initial data (https://arxiv.org/pdf/2102.09672.pdf). At each timestep, a slightly less noisy image is predicted. In the case of OpenAI’s DDM using a UNet architecture with global attention and a projection of the timestep embedding into each residual block.

Yeah, high quality image generation is cool and all but, can the output be conditioned? Well, can a Python programmer learn Java without screaming? Well, no…, but yeah, it can be conditioned … Google’s SR3 model converts a very low resolution image in a crispy HD one by learning to transform a standard normal distribution into an empirical data distribution using this sequence of refinement steps. The process idea is similar to the one explained above but also taking in account the initial low resolution image in the denoising process as a channel merge with the current timestep noisy image. The process is done 2000 times and it’s also trained with a UNet architecture with some fancy modifications.

To finish with the icing on the cake, the latest Google work on this matter: Palette. It does not only generate State Of The Art results on several image-to-image translation tasks, it also doesn’t need task specific hyperparameter tuning nor architecture customization nor auxiliary loss (suck it GANs!). The main changes from the previous work are more modifications to the UNet architecture and no class conditioning (only image conditioning).

Palette. Panorama made from extrapolating just the center 256x256px image

____Mixers (many implementations)

Computer Vision people hate NLP people like Englishmen and Scots! Or Welshmen and Scots! Or Japanese and Scots! Or Scots and other Scots! Damn NLPs! They ruined NeurIPS!

Transformers with Self-Attention came to stay in the NLP field. They performed extremely well on every language task and scalate easily to big datasets. But the peace was broken when someone came up with the idea of bringing that concept to Computer Vision. We all said “It’s impossible to perform per-pixel attention!”, “It won’t work!”, “It’s too memory intensive!” until that same person performed attention using 16x16 patches and outperformed several image classification SOTAs. Computer Vision people we were devastated for this invasion “Noam Chomsky was right… intelligence comes from language…”. It all seemed lost, every CV paper used some self-attention mechanism, from self-supervised to image-to-image (even denoising! I’ve never thought we’d lose denoising…). Suddenly if you didn’t have a NASA computer and the entire Google as dataset, you were out of the game.

But then it came. Like an summer’s breeze that whispers in your ear: “MLP-Mixers…”. The salvation for the NLP haters in the form of the less expected individual, the pawn in this game: the Perceptron. Because no-one in vision forgot about the importance and power of the Perceptron WRIGHT?? Suddenly it all made sense, the performance of Vision Transformers came from mixing the patches solely! Nice! Some per-patch linear embeddings, mixing layers, global average pooling… et voilà, outstanding results that can compete (not overtake yet) with Vision Transformers. Only using Multi Layer Perceptrons.

MLP-Mixers are not dependent on input data, easier to train and doesn’t need positional encodings (which technically makes them sensible to permutations).

The Computer Vision people were satisfied, well, almost. The MLP-Mixers were nice and all but it was lacking something… the only thing that could satisfy a Computer Visual: convolutions! And thus, the ConvMixers were born. And what a neonate… it’s still under double-blind review and already outperforms ResNets, Vision Transformers and MLP-Mixers using only standard convolutions.

The architecture emulates the idea of MLP-Mixers that the real performance of Vision Transformers comes from the patch-based representation rather than the Transformer architecture itself. ConvMixers operates on patches, maintains resolution and size throughout all layers, does not produce bottlenecks, does channel-wise mixing and the entire architecture fits in a Tweet. Suddenly, people with normal Deep Learning PCs can use SOTA techniques again, power to the people!

Self-Supervised Learning without Contrastive Pairs (https://arxiv.org/pdf/2102.06810.pdf)

In the CLIP section we talked about contrastive learning and how it learns an embedding by minimizing/maximizing distances between pairs. CLIP uses negative and positive pairs to learn that embedding, but approaches like BYOL or SimSiam don’t require positive+negative data pairs, only two augmented views of the same image that go into a Siamese Neural Network (model for comparing entities) with BYOL having a momentum encoder and SimSiam using a Stop Gradient operation in one of its branches. The idea of both being that one branch (predictor branch) learns the same as the other branch (online branch) so there exists a balancing that ensures that any matching between the online and target representations will not be attributable solely to the predictor weights. Weight decay and stop-gradient help to this balancing. They are both more efficient, simpler and require lower batch-size while also maintaining SOTA.

Two-layer setting with a linear, bias-free predictor

The paper linked on the section title explains the magic that goes beneath those methodologies. It turns out it’s math, the boring part of math. They debugged the mathematics of the model by simplifying the dynamics that results in some observations: larger weight decay helps collapse because the online network needs to grow along the predictor so the weight decay slows the predictor down while also modeling invariance to augmentations correctly; larger predictor learning rate can work the same as larger weight decay.

Also DirectPred is introduced as a predictor that avoids using Gradient Descent by estimating a correlation matrix of the predictor inputs and set its weights to be a function of this. That correlation matrix is computed by the eigenspace alignment between the weights of the predictor and the correlation matrix and convergence to the invariant parabola using weight decay.

Phew… That was dense.

Honorable Mentions

I wanted to mention before finishing some of what I think are amazing ideas that had somewhat magnitude this year (but not necessary born this year) and that are going to be in the near future of IA:

How to represent part-whole hierarchies in a neural network. I’m going to quote a very good description made by Yannic Kilcher that I think describes this paper better than I ever could: “Geoffrey Hinton describes GLOM, a Computer Vision model that combines transformers, neural fields, contrastive learning, capsule networks, denoising autoencoders and RNNs. GLOM decomposes an image into a parse tree of objects and their parts. However, unlike previous systems, the parse tree is constructed dynamically and differently for each input, without changing the underlying neural network. This is done by a multi-step consensus algorithm that runs over different levels of abstraction at each location of an image simultaneously. GLOM is just an idea for now but suggests a radically new approach to AI visual scene understanding.”
Knowledge Distillation. Neural networks have grown bigger and bigger and need more computational resources each year. One way to transfer the knowledge to an smaller network whilst maintaining its accuracy is using the so-called Knowledge Distillation. First defined by Hinton (he’s everywhere) is basically defined by a Student-Teacher learning methodology that extracts the most important information from a huge network to a smaller one. This paper I think explains very broadly the SOTA and new outlooks of KD.
Self/Zero/Un-Supervised Learning. Deep Learning community has develop amazing architectures that can really benefit to be trained on huge amounts of data. Now the bottleneck lies on that data gathering and labeling which requires hours upon hours of human work which is of course highly inefficient. This paper (focused on Self-Supervised Learning) explains very neatly the advantages and disadvantages of letting the network generate its own labels and how does it change the network’s data internal representation.
Capsule Networks. Hinton?, Hinton! We mentioned it in GLOM and the concept is nowhere close to being from 2021, but something tells me that is going to grow in magnitude in the next coming years. The main idea is adding more structure to a standard CNN in the form of probability of the observation and the pose of it. This way, the image recognition gets an additional spatial robustness i.e. permutations on the image. Also rejects the idea of Pooling which compares it to the Antichrist (biologically speaking).
Generative Flows. Unsupervised learning, reinforcement learning, image generation, … You name it! Normalizing flows based distribution modeling is coming to your city this summer, and it’s going to stay for a while. Amazon Alexa’s voice is generated using these. Is it an easy concept to understand? No. Really? Should I give up on my dreams of learning normalizing flows? Never give up! It is an amazing way to model data likelihood directly and it has generated amazing results compared to SOTA image and audio generation, but the math is strong with this one, but think it this way: it’s just going to take a little more time to get the general idea that other concepts, you’ve got this!

There is more stuff that I think is going to grow next year that I wanted to include like Lambda Networks or Deep Riemannian Manifold Learning, but there’s gotta be an end to everything and we have arrived to ours.

Soooo… this was pretty much my experience with Deep Learning this year. I hope I did well enough to take the Deep Learning nerds’ critics. I’m not good with goodbyes so I’ll leave you with a video of grandma using a voice-to-text recognition keyboard. Happy new year!

Follow me on LinkedIn and see my projects in GitHub! Did you like the story? Leave a comment below and share it in your socials!

Funny gram using tech

PD: Sorry for the NLP vs CV joke, I know we get along pretty well ❤️. Also any comment on anything missing or not right is very welcomed. Long live the Deep Learning community 🤖!