3 Overlooked things Deepmind Flamingo: A Large Model for Computer Vision

After Google’s Pathways, this is a further indication of the potential for Multi-Modal Training. And much more

Published in

Geek Culture

8 min readMay 26, 2022

Late April, Deepmind posted Tackling multiple tasks with a single visual language model. Their results showed that their proposed model, Flamingo, outperformed the previously SOTA models on a variety of tasks.

These results are exciting for a variety of reasons. They also carry interesting implications. We will discuss this further on.

In this post, I will go over some of the interesting takeaways from this super exciting publication (and their detailed preprint, Flamingo: a Visual Language Model for Few-Shot Learning). Their model setup, in particular, is very interesting because of the design choices they implemented and how they might influence the future of AI/Large Scale Machine Learning. However, people are overlooking some important details, which I want to focus on. To understand this well, let’s first understand the context behind the field.

This level of recognition is staggering. Certainly worth getting excited about

The History of LLMs

We have seen a lot of potential for Large Language Models in NLP over the last few months. Taking Neural Networks with over 100 Billion Parameters, we can take these LLMs over a variety of tasks. This has allowed us to create models with a much deeper representation of knowledge and a plethora of capabilities.

How adding parameters to the PaLM model adds to the capabilities of the model. Source: Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

However, all these models have been focused on mostly NLP tasks. GPT, PaLM, and Facebook’s OPT are all primarily Language Models, although they have seen some success with vision tasks. These tasks have been exceptional for combinations of NLP and Vision. We’ve seen a lot of this recently with generating images from captions.

Taken from Make-A-Scene. The quality of images that Meta’s AI generates is stunning. Read about how this ties into their MetaVerse aspiration

The Deepmind team took a similar approach in training with both vision and language models. However, they directed their focus toward vision tasks. Following is an example of their model working on the very interesting Soup Monster test.

DeepMind Flamingo vs Soup Monster created by Open AI's DALL-E 2

Taken from: https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-modelLike the video and…

youtube.com

With all the background out of the way, let's dive into some of the things that stood out to me.

Multi-Modal Training continues to live up to the hype. But…

To those of you that have been following my content recently, you know I’ve been more and more interested in this idea. Traditionally, we develop different models for different tasks. Each dataset/task had its own custom-trained model. We also realized that we could use the custom-wire our network for different kinds of tasks. This led to the development of CNNs for Vision and RNNs for NLP. These were specialized for their domains.

Obviously, Random Forests are the greatest technique known to man. All other ideas are fighting for second place. For more wisdom, check out my Twitter.

However, transformers flipped this quite a bit. As I’ve covered in more detail here, the attention mechanism has great utility in both NLP and CV. This has made them ideal for the large-scale multi-modal training that is gaining more traction in Deep Learning.

The guiding property is interesting. As we scale into more abilities, we might just see one final model do everything. This would align well with Google’s Pathways vision

Multi-Modal training simply refers to training your models with multiple kinds of input. For example, we can see that we use both text and images to get our final results. The advantage of such a system is that the models have more information to run through. This allows them to develop a much deeper knowledge of the domain.

Taken from Introducing Pathways: A next-generation AI architecture

What is interesting to me is that the procedure for multi-modal training seems to lead to a very different result than what we have. Unlike Humans, Flamingo is not affected by the Stroop Test. If I had to guess, this is because humans don’t use “Frozen” models, which makes us more likely to be influenced. A more in-depth analysis of this and the possible consequences of such differences would be important.

The Model Freezing Conundrum

The results shared by the team have without a doubt been exciting. Flamingo is achieving things that 5 years ago would have been considered unreal. And this is quite literally just the beginning.

Think of how many use-cases just the examples in this image have.

However, the authors of this paper brought up 2 interesting things. To understand how they are related and important, let’s understand the concept of Model Freezing. Freezing a model simply stops new input data from changing its weights. This is used in domains like transfer learning where we use a baseline pretrained model and finetune it for a specific case. This has had great results, and Flamingo is no exception.

If trained from scratch, the performance decreases by a large margin in both cases (−11.8% for the Vision Encoder and −10.2 for the LM), highlighting again the importance of pretraining. Interestingly, starting from our good initialization while also allowing unfreezing the weights also leads to a drop in performance (−3.9% when unfreezing the Vision Encoder and −5.5% when unfreezing the LM). This is an instance of “catastrophic forgetting” (McCloskey and Cohen, 1989), in which the model progressively forgets its pretraining while training on a new objective.

So freezing pretrained models boosts performance? Nothing shocking. However on page 33 (no that’s not a typo) of their preprint , the authors bring up a pertinent point, “Our models build on powerful pretrained causal language models, and as a side effect, directly inherit their weaknesses”. This is obviously something that we need to investigate. Especially as we get to datasets that are biased, or other potential problems faced by old models, this might end up being an issue. Navigating this will be crucial.

On another note, while our ablations demonstrate the importance of the language model priors inherited from frozen language models, we suspect that they may play a role in occasional hallucinations and ungrounded guesses observed in open-ended dialogue settings

ML Models need an “I don’t Know”

The authors did a phenomenal covering all the angles of their work. One of the things I really appreciate is how they covered Flamingo’s performance against adversarial testing. Here we can see that Flamingo comes up with some interesting answers, when asked purposefully misleading questions.

This is likely due to the fact that during training the model is forced to answer something

Diagnosing these will be important in integrating the solution into more sensitive use cases. The middle “hallucination” is likely the result of always assuming the prompts are always truthful. Adding a degree of “skepticism” might help address it. The other 2 are probably caused by the fact that ML models generally aren’t given an “I don’t Know” option. This would need to be intergrated. The current goto, Model Confidence, is not good enough.

In my pipelines, I typically use either non-deterministic models and/or an ensemble of estimators to quantify how reliable a prediction is. This was inspired by the idea behind Label Dispersion, which I discuss here. However, the scale at which Big Tech ML is tested is very different. So I’m not sure how feasible it will be. If you have any thoughts on how to tackle this issue, share it in the comments below.

Maybe I’m making too much of it. These results are insane.

Don’t take this the wrong way though. The results achieved by the authors of Flamingo are nothing short of stunning. But everyone and their mom can tell you that. Which is why, I wanted to focus on my article on these areas, that I believe need more attention.

That’s it for this article. If you’re looking to get into ML, this article gives you a step-by-step plan to develop proficiency in Machine Learning. It uses FREE resources. Unlike the other boot camps/courses, this plan will help you develop your foundational skills and set yourself up for long-term success in the field.

Thank you all for the love.

For Machine Learning a base in Software Engineering, Math, and Computer Science is crucial. It will help you conceptualize, build, and optimize your ML. My daily newsletter, Coding Interviews Made Simple covers topics in Algorithm Design, Math, Recent Events in Tech, Software Engineering, and much more to make you a better developer. I am currently running a 20% discount for a WHOLE YEAR, so make sure to check it out.

I created Coding Interviews Made Simple using new techniques discovered through tutoring multiple people into top tech firms. The newsletter is designed to help you succeed, saving you from hours wasted on the Leetcode grind. I have a 100% satisfaction policy, so you can try it out at no risk to you. You can read the FAQs and find out more here

Feel free to reach out if you have any interesting jobs/projects/ideas for me as well. Always happy to hear you out.

For monetary support of my work following are my Venmo and Paypal. Any amount is appreciated and helps a lot. Donations unlock exclusive content such as paper analysis, special code, consultations, and specific coaching:

Venmo: https://account.venmo.com/u/FNU-Devansh

Paypal: paypal.me/ISeeThings

Reach out to me

Use the links below to check out my other content, learn more about tutoring, or just to say hi. Also, check out the free Robinhood referral link. We both get a free stock (you don’t have to put any money), and there is no risk to you. So not using it is just losing free money.

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

If you’re preparing for coding/technical interviews: https://codinginterviewsmadesimple.substack.com/

Get a free stock on Robinhood: https://join.robinhood.com/fnud75