Llama 3.2 Vision, the new multi-modal LLM by Meta

How to use Llama 3.2 and new features explained

Published in

Data Science in your pocket

4 min readSep 26, 2024

The Generative AI space is now at its full swing where someday, we get a PhD level intelligent LLM (OpenAI-o1), while on the other, its open-source models like Llama 3.2Vision that makes the headlines. So, the much-anticipated Vision version of Llama3.1 is out now which is Llama3.2 and keeping up the promise, it is also open-sourced and free to use by Meta.

Subscribe to datasciencepocket on Gumroad

On a Mission to teach AI to everyone !

datasciencepocket.gumroad.com

Multimodal Capabilities

Llama 3.2 marks a notable shift towards multimodality, particularly with its 11B and 90B models, which can process both text and images. These models are designed to interpret visual data, such as charts and graphs, and can perform tasks like image captioning and visual question answering. For instance, they can analyze a park map to answer questions about terrain changes or distances.

This is just too good !!

Model Variants

The Llama 3.2 family includes several models tailored for different use cases:

90B Vision Model: The most advanced model, ideal for enterprise applications requiring sophisticated reasoning and image understanding.
11B Vision Model: A compact version suitable for content creation and conversational AI.
1B and 3B Text Models: Lightweight models optimized for edge devices, capable of tasks like summarization and rewriting. Being small in size, anyone should be able to run these locally with minimal hardware.
Every version comes in base and instruction-tuned variants.

If you don’t know

Base Models: These are the foundational large language models trained on a large corpus of online data. They have strong general knowledge and language understanding capabilities. They are more of text generation models rather than Q&A based.
Instruction-Tuned Models: These models are further fine-tuned using techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). This aligns the models to follow instructions better and produce outputs that are more helpful and safe. Hence better for direct Q&A.

Most of the Chatbot UIs we use like ChatGPT, Perplexity are usually instruction fine-tuned

Llama 1B and 3B models aren’t vision models, but just text models

Architecture

The Llama 3.2 Vision models are built upon the foundation of Llama 3.1 language models. Specifically:

The Llama 3.2 11B Vision model uses the Llama 3.1 8B text model as its base.
The Llama 3.2 90B Vision model uses the larger Llama 3.1 70B text model.

These text models are combined with a vision tower and an image adapter to enable multimodal capabilities. During the training process for the vision models, the underlying text models were kept frozen. This approach helps preserve the strong text-only performance of the original Llama 3.1 models while adding image processing abilities.

What is an image adapter?

Adapters are small sets of additional trainable parameters added to a pre-trained language model (LLM) to enable efficient fine-tuning for specific tasks without modifying the original model’s parameters.
The adapter includes a series of cross-attention layers that facilitate the flow of image representations into the language model, enabling it to reason about both visual and textual data.

What is a Vision tower?

The vision tower is part of the overall architecture that includes the image adapter. While the vision tower handles the processing of visual information, the image adapter facilitates the integration of this visual data into the language model.
The vision tower is responsible for extracting features from images using a pre-trained image encoder. It processes these features and prepares them for interaction with the language model.
The image adapter, on the other hand, consists of cross-attention layers that feed these image representations into the core language model.

Evaluations and metrics

Meta’s evaluation indicates that the Llama 3.2 vision models are competitive with top foundation models, such as Claude 3 Haiku and GPT4o-mini, in image recognition and various visual understanding tasks. The 3B model surpasses the performance of Gemma 2 2.6B and Phi 3.5-mini models in areas like following instructions, summarization, prompt rewriting, and tool usage, while the 1B model remains competitive with Gemma.

Llama Guardrails

Llama Guardrail functions as a safeguard to classify and evaluate both inputs and outputs during interactions with the model. It aims to prevent the generation of harmful or inappropriate content.
A new small version of Llama Guard, called Llama Guard 3 1B, has been introduced to work alongside the Llama models. This version evaluates the most recent user or assistant responses in multi-turn conversations and features customizable pre-defined categories that developers can modify or exclude based on their specific use cases

Where to access?

How to run it locally?

Hope this was useful, and you try out Llama3.2 shortly!!