Facebook AI and UC Berkley pick a fight with Transformers

With all the insane hype around GPT3, DALLE, PaLM, and many more, now is the perfect time to cover this paper.

Published in

Geek Culture

10 min readApr 27, 2022

Join 31K+ AI People keeping in touch with the most important ideas in Machine Learning through my free newsletter over here

Go through the Machine Learning news these days, and you will see Transformers everywhere (watch this video IBM Technology for a quick overview to the idea). And for good reason. Since their introduction, Transformers have taken the world of Deep Learning by storm. While they were traditionally associated with Natural Language Processing, Transformers are now being used in Computer Vision Pipelines too. Just in the last few weeks, we have seen the use of Transformers in some insane applications in Computer Vision. Thus, it seemed like Transformers would replace Convolutional Neural Networks (CNNs) for generic Computer Vision tasks.

DALL·E: Creating Images from Text really pushed a lot of boundaries with what was considered possible.

Researchers at Facebook AI however have something to add. In their paper, “A ConvNet for the 2020s”, the authors posit that a large part of the reason that Transformers have been outperforming CNNs in Vision-related tasks has been the superior training protocols used by Transformers (which are a newer architecture). Thus, by improving the pipeline around the models, they argue that we can close the performance gap between Transformers and CNNs. In their words,

In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way.

The results are quite interesting, and they show that CNNs can even outperform Transformers in certain tasks. This is more proof that your Deep Learning Pipelines can be improved with better training, rather than simply going for bigger models. In this article, I will cover some interesting findings from their paper. But first some context into Transformers and CNNs and the advantage of each kind of architecture in Computer Vision tasks.

CNNs: The OG Computer Vision Networks

Convolutional Neural Networks have been the OG Computer Vision Architecture since their inception. In fact, the foundations of CNNs are older than I am. CNNs were literally built for vision.

The feature extraction is the true CNN revolution. Taken from IBM’s Writeup on ConvNets

So what’s so good about CNNs? The main idea behind Convolutional Neural Nets is that they go through the image, segment by segment, and extract the main features from it. The earlier layers of the CNN often extract the more crude features, such as edges and colors. However, adding more layers allows for feature extraction at a very high resolution of detail.

CNNs use the sliding window technique to build their feature maps. As you can see, Good Machine Learning requires good software engineering. Image Source

This article goes into CNNs in more detail. For our purposes one thing is important: CNNs have been the go-to for Computer Vision primarily due to their ability to build feature maps.

Transformers- The New Generation?

Transformers are the cousins of CNNs. They are improvements on traditional Recurrent Neural Networks (RNNs). RNNs were created to handle temporal data, where the past follows the future. Let’s take a simple example. Imagine the sentence, “Don’t eat my _”. To fill in the _, we need to take context from the previous words in the sentence. Filling out random words wouldn’t do us any good. Unlike Traditional Networks, RNNs take information from Prior Inputs. This video by codebasics is a great introduction to the idea, for those of you that want to learn more.

RNN hidden layers feedback into themselves, allowing them to use prior input in their predictions. Image Source

However, RNNs have a flaw. Since they work on sequential data, data order is important. This makes them impossible to parallelize since we have to feed the inputs in orders. Transformers were created to solve this issue. Transformers use attention to identify important parts of the input and use store those in memory.

How Transformers are adapted to different tasks. Image Source

Since Transformers can be parallelized, we have seen some enormous datasets being trained. Googles BERT and OpenAI’s GPT3 are some notable examples. We have seen them achieve some insane functionality. However, there is one question this leaves — Transformers were built for NLP, so why are they good for Computer Vision? Is it just blind luck + lots of training?

Encoder-Decoder pairs have a lot of use in translation, reconstruction, deepfakes, and a bunch of other cool ideas. Image Source

Transformers are very effective because of the way they handle inputs. Transformers leverage encoders and decoders. Encoders take your input and encode it into a latent space. The Decoder takes vectors from the latent space and transforms them back.

DALL-E truly is amazing. This is another one of their functionalities.

This can be used in a variety of ways in Computer Vision. Adversarial Learning, Reconstruction, Image Storage, and Generation are some notable examples. It also plays a crucial role in DALL-E. We take the text input and encode it into the latent space. Then we can take a decoder that decodes the latent vectors into an image. This is how we are able to generate images from text descriptions. Facebook AI’s Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Taken from Make-A-Scene. The quality of images that Meta’s AI generates is stunning. Read about how this ties into their MetaVerse aspiration

While I was doing the research for this video, I actually learned of another reason why Vision Transformers have been so effective. It ties into their attention mechanism. It’s interesting enough to warrant it’s own article, but here is a summary.

I sent an earlier draft of this article to some experts. One of them actually taught me about the following properties of vision transformers. Goes to show you how deep and complex AI is. Shoutout to Dr. Ajun Jain for his constant assistance and input on my drafts.

The attention mechanism in Transformers allows them to identify the parts of a sentence that are important. Attention allows Transformers to filter out the noise and capture relationships between words that are even far apart.

Taken from the legendary Google Paper, Attention Is All You Need. Not needing sequence is a big deal.

Everyone already knows this in the context of NLP. What I didn’t know is that this holds true even for CV. The attention mechanism allows Transformers to keep a “global view of the image” allowing them to extract features very different to ConvNets. Remember, CNNs use kernels to extract features, which restricts means that they find the local features. Attention allows Transformers to bypass this.

Since Transformers also use Conv Priors, this is a best of both worlds sort of deal.

The above picture is taken from the very interesting, Do Vision Transformers See Like Convolutional Neural Networks? It’s interesting enough that I will do a breakdown of this paper later. The important aspect is the following quote, also from the paper.

…demonstrating that access to more global information also leads to quantitatively different features than computed by the local receptive fields in the lower layers of the ResNet

Clearly, Transformers are very powerful. The attention mechanism, large-scale training, and modern architecture have seen them become the backbones for a lot of vision tasks. So the question is, do pure CNNs stand a chance? Here is a comparison of the new CNN training method compared to vision Transformers, taken from the paper.

As we can see by this diagram, ConvNeXt is able to even beat Swin Transformers.

Clearly, this is very exciting stuff. By improving the training pipelines around the CNN architecture, we can match SOTA transformers. This goes to show the power of setting up good Machine Learning training pipelines. They can compensate for using weaker models and are often more cost-effective.

Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

Modernizing The CNN Infrastructure

Clearly, we can improve our standard models and achieve amazing performance. Let’s talk about some of the interesting design tweaks they did to make this possible (to know them all, make sure you read the paper). I will also go over some insights I had reading this paper, and that I would love to discuss with y’all. Share your thoughts in the comments/with me through IG/LinkedIn/Twitter (links at the end of this article).

Increasing Kernel Size

Transformers use larger kernel sizes than ConvNets. The authors point out, “Although Swin Transformers reintroduced the local window to the self-attention block, the window size is at least 7x7, significantly larger than the ResNe(X)t kernel size of 33. Here we revisit the use of large kernel-sized convolutions for ConvNets.” The authors experiment with Kernel Sizes and find that larger sizes improve performance. However, they find that the benefits of increasing kernel size vanish after getting to 7x7 kernels.

With all of these preparations, the benefit of adopting larger kernel-sized convolutions is significant. We experimented with several kernel sizes, including 3, 5, 7, 9, and 11. The network’s performance increases from 79.9% (33) to 80.6% (77), while the network’s FLOPs stay roughly the same.

Inverted BottleNeck Design

The bottleneck design is common in neural networks. Many encoder-decoder pairs tend to downsample input into the latent space and upsample from the latent space. However, the authors found using an inverted bottleneck design to be superior (and is something used by Transformers).

The benefit of this design pattern is shown below-

The increased performance in larger networks is an interesting occurrence. I would be interested in seeing is this improved performance continues to scale with the bigger models. This would have a lot of potential in the extremely large scale language models that have become the trend today. If somebody has an idea of why it happens, please do share. I’d love to learn.

Tweaking Activation Functions

As I’ve covered before, activation functions are a big deal. The authors made some changes to ResNet blocks to resemble the Transformers. The first was changing the activation functions to GeLU (Gaussian Error) from ReLU

GeLU is a much smoother function than ReLU. This is why Google and OpenAI use it. Source

The authors also changed the number of activation functions in a block. They, “As depicted in Figure 4, we eliminate all GELU layers from the residual block except for one between two 1x1 layers, replicating the style of a Transformer block. ” This resulted in a 0.7% improvement.

While the results were impressive, look at the ROI for accuracy compared to the amount of memory used. The reason I recommend most people stick to the basics is that they don’t have Google’s resources

Separating Downsampling Layers

Another inspiration from Transformers, the authors decided to separate downsampling. Instead of doing it all at the start of a stage (like traditional ResNet), they have downsampling between layers as well (like vision transformers). The ConvNeXt Block thus looks like this-

LN- Layer Normalization. Notice our new network uses LN instead of BN (batch normalization) like traditional CNNs (ResNet)

This has quite an impact, with the authors pointing out that by following this strategy “We can improve the accuracy to 82.0%, significantly exceeding Swin-T’s 81.3%.”

Note on AGI

The changes I covered are by no means comprehensive. There were a ton of changes the authors implemented to make their networks more “modern”. Fortunately, they list the changes in the appendix of the paper, along with the benefit each change brought with it.

That, my lovely reader, is why we should always read the appendix. This gem is perfect.

As I was reading this paper, it got me thinking about Artificial General Intelligence (AGI). Many of these changes were inspired by the success in other models/networks (the authors took inspiration from both Transformers and other CNNs). Given the results of research into data imputation, learning rates, batch sizes, architectures, etc. it leads me to a question- Is the Key to AGI in “a perfect training protocol”. Is there a perfect configuration for training, across tasks? I cover this in slightly more detail in this 6 Minute video but would love to hear your thoughts.

That’s it for this article. Interesting AGI hypotheticals aside, this paper once again shows us the importance of constantly learning about the foundational research going on in Machine Learning. To develop your foundational ML skills enough to a point where you can interact with papers, check out my article, How to learn Machine Learning in 2022. It goes over the free online resources you can leverage to level up your Machine Learning.

To help me write better articles and understand you fill out this survey (anonymous). It will take 3 minutes at most and allow me to improve the quality of my work. Please do use my social media links to reach out with any additional feedback. All feedback helps me improve.

For Machine Learning, a base in Software Engineering is crucial. It will help you conceptualize, build, and optimize your ML. My daily newsletter, Coding Interviews Made Simple covers topics in Algorithm Design, Math, Recent Events in Tech, Software Engineering, and much more to make you a better developer. I am currently running a 20% discount for a WHOLE YEAR, so make sure to check it out.

I created Coding Interviews Made Simple using new techniques discovered through tutoring multiple people into top tech firms. The newsletter is designed to help you succeed, saving you from hours wasted on the Leetcode grind. You can read the FAQs and find out more here

Feel free to reach out if you have any interesting jobs/projects/ideas for me as well. Always happy to hear you out.

For monetary support of my work following are my Venmo and Paypal. Any amount is appreciated and helps a lot. Donations unlock exclusive content such as paper analysis, special code, consultations, and specific coaching:

Venmo: https://account.venmo.com/u/FNU-Devansh

Paypal: paypal.me/ISeeThings

Reach out to me

Use the links below to check out my other content, learn more about tutoring, or just to say hi. Also, check out the free Robinhood referral link. We both get a free stock (you don’t have to put any money), and there is no risk to you. So not using it is just losing free money.

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

If you’re preparing for coding/technical interviews: https://codinginterviewsmadesimple.substack.com/

Get a free stock on Robinhood: https://join.robinhood.com/fnud75