Introducing LLaVA: The Fusion of Visual and Linguistic Intelligence in AI with code

Published in

azhar labs

6 min readJan 27, 2024

In the rapidly evolving landscape of artificial intelligence (AI), a groundbreaking development is reshaping our interaction with technology. The LLaVA project, a fusion of Large Language Models (LLMs) like GPT-4 and vision encoders like CLIP, is pioneering a new era of AI assistants that understand and act upon both visual and language instructions. This article delves into the innovative approach of LLaVA, its capabilities, and its potential impact on various domains.

Before we proceed, let’s stay connected! Please consider following me on Medium, and don’t forget to connect with me on LinkedIn for a regular dose of data science and deep learning insights.” 🚀📊🤖

LLaVA

LLaVA stands for Large Language and Vision Assistant, a cutting-edge AI model designed to integrate the capabilities of language understanding and visual perception. This integration addresses a significant gap in current AI technology, where models are often limited to either visual or textual comprehension.

The Vision-Language Gap in AI

Traditional AI models often operate either on language or visual inputs but rarely intertwine the two effectively. Vision models excel in interpreting images but lack the nuanced understanding of language, while LLMs like GPT-4 are adept at language processing but don’t “see” images. LLaVA aims to bridge this gap, creating an AI assistant that comprehends and interacts through both modalities.

The LLaVA project combines the linguistic prowess of GPT-4 with the visual acuity of CLIP’s vision encoders. This synergy allows LLaVA to understand complex instructions involving both text and images, significantly expanding the range of tasks it can perform.

The Mechanics of LLaVA

Visual Instruction Tuning

A core innovation in LLaVA is “visual instruction-tuning,” which translates image-text pairs into actionable data for the AI. This process involves generating instruction-following data from images, enriched by GPT-4’s language capabilities.

GPT-Assisted Data Generation

LLaVA leverages GPT-4 to create diverse and context-rich instruction sets from images. This approach not only enhances data quality but also infuses the model with deep, nuanced understanding.

Integrating Vision and Language

At the heart of LLaVA is the integration of visual features from CLIP with language embeddings from LLMs like GPT-4. This integration involves a projection matrix that maps visual data into the language domain, allowing the AI to process and understand multimodal inputs seamlessly.

Training LLaVA

LLaVA undergoes a two-stage training process. Initially, it focuses on aligning features between the vision and language models. Subsequently, it undergoes fine-tuning for specific applications like visual chat and science question-answering, leveraging multimodal data.

Expanding Horizons: Features and Upgrades

LLaVA isn’t static; it’s continuously evolving with new features:

Quantization Support: Enhances performance and efficiency.
Reinforcement Learning from Human Feedback (RLHF): Improves factual accuracy.
Higher Resolution Support: Allows processing of detailed visual information.
Benchmarking Tools: Provides platforms for performance evaluation.

The following analysis delves into a Python code snippet designed to leverage the capabilities of LLaVA (Large Language and Vision Assistant). This multifaceted AI model, as previously discussed, is engineered to interpret and interact with both textual and visual inputs, thereby representing a significant advancement in AI technology. The code snippet exemplifies how LLaVA can be utilized in practical applications, illustrating its potential in analyzing and responding to a diverse range of queries that encompass visual imagery and contextual textual information. This functionality is particularly relevant in scenarios where AI’s understanding of the world must go beyond mere textual interpretation, embracing a more holistic, multimodal approach.

Let’s examine the key components and functionalities of this Python code to better understand how LLaVA integrates visual and language processing.

Import Statements

import textwrap
from io import BytesIO

import requests
import torch
from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
from llava.conversation import SeparatorStyle, conv_templates
from llava.mm_utils import (
    KeywordsStoppingCriteria,
    get_model_name_from_path,
    process_images,
    tokenizer_image_token,
)
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from PIL import Image

The code begins by importing necessary modules and functions. textwrap is used for formatting output text, BytesIO for handling byte streams, and requests for making HTTP requests. PyTorch (torch) is the primary library for working with the LLaVA model, and PIL.Image is used for image processing.

Initializing and Loading the Model

MODEL = "4bit/llava-v1.5-13b-3GB"
model_name = get_model_name_from_path(MODEL)

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=MODEL, model_base=None, model_name=model_name, load_4bit=True
)

Here, the torch initialization process is disabled for performance optimization. The LLaVA model and its components (tokenizer, image processor) are loaded. The model used is llava-v1.5-13b-3GB, a specific version of LLaVA.

Image Loading and Processing

def load_image(image_file):
    if image_file.startswith("http://") or image_file.startswith("https://"):
        response = requests.get(image_file)
        image = Image.open(BytesIO(response.content)).convert("RGB")
    else:
        image = Image.open(image_file).convert("RGB")
    return image

def process_image(image):
    args = {"image_aspect_ratio": "pad"}
    image_tensor = process_images([image], image_processor, args)
    return image_tensor.to(model.device, dtype=torch.float16)

These functions handle the loading and processing of images. load_image fetches an image from a URL or file path and converts it to RGB format. process_image processes the loaded image into a tensor format suitable for input into the LLaVA model.

image = load_image("feeding-to-goat.jpeg")
processed_image = process_image(image)

Creating a Prompt

CONV_MODE = "llava_v0"

prompt, _ = create_prompt("Describe the image")
print(prompt)

>>> """
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>
Describe the image###Assistant:
"""

This function prepares the prompt for the model by appending it with a default image token and formatting it according to the conversation mode (CONV_MODE).

The `ask_image` Function

def ask_image(image: Image, prompt: str):
    image_tensor = process_image(image)
    prompt, conv = create_prompt(prompt)
    input_ids = (
        tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
        .unsqueeze(0)
        .to(model.device)
    )

    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    stopping_criteria = KeywordsStoppingCriteria(
        keywords=[stop_str], tokenizer=tokenizer, input_ids=input_ids
    )

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=image_tensor,
            do_sample=True,
            temperature=0.01,
            max_new_tokens=512,
            use_cache=True,
            stopping_criteria=[stopping_criteria],
        )
    return tokenizer.decode(
        output_ids[0, input_ids.shape[1] :], skip_special_tokens=True
    ).strip()

This is the core function. It takes an image and a text prompt, processes them, and generates a response using the LLaVA model. The function converts the prompt into tokens, processes the image, and then feeds both into the model to generate a response. The output is decoded into human-readable text.

The code demonstrates how to use the ask_image function with various prompts. For each prompt, the model generates a response based on the combined understanding of the text and the image.

Describing an Image: When asked to describe an image, the model generates a detailed description based on the visual content.

result = ask_image(image, "Describe the image")
print(textwrap.fill(result, width=110))

"""
>>> The image features a man sitting on a bench, holding a baby goat in his arms. The man is feeding the baby goat
with a bottle, providing nourishment and care. The scene appears to be set in a comfortable environment,
possibly a home or a cozy outdoor space. The man's attention is focused on the baby goat, ensuring its well-
being and growth.
"""

Answering Specific Questions: The model can respond to more direct questions about the image, such as whether the man in the image is handsome.

result = ask_image(image, "Is the man handsome?")
""">>>Yes, the man is described as handsome in the image."""

Interpreting Charts: The model can even interpret data from charts in the image, as shown in the Bitcoin price example.

result = ask_image(
    image,
    "This is a chart of Bitcoin price. What is the current price according to the chart?",
)
print(textwrap.fill(result, width=110))
"""
>>> The current price of Bitcoin according to the chart is $23,000.
"""

code

medium/notebooks/llava.ipynb at main · azharlabs/medium

Contribute to azharlabs/medium development by creating an account on GitHub.

github.com

Bridging Domains

The adaptability of LLaVA extends to various specialized fields. LLaVA-Med, for instance, is a variant tuned for biomedical applications. This flexibility opens up possibilities for AI assistants tailored to specific industries, from healthcare to legal analysis.

The practical applications of LLaVA are vast and varied. From providing rich, context-aware descriptions of images to interpreting complex visual data and responding to nuanced queries, LLaVA’s capabilities hint at a future where AI can assist in a wide range of domains, including but not limited to healthcare, education, autonomous vehicles, and customer service.

In conclusion, the integration of large language models with visual encoders, exemplified by LLaVA, marks a pivotal moment in the journey towards creating more intuitive, versatile, and powerful AI assistants. As this technology continues to evolve and integrate more deeply with our daily lives, it paves the way for a future where AI is not just a tool, but a collaborative partner in our quest to understand and interact with the world around us.