Unlocking Visual Narratives: A Deep Dive into LLaVA’s Image Captioning with AI

Tony Esposito
3 min readNov 25, 2023

Introduction

In the rapidly evolving landscape of Generative AI, novel approaches to image processing and language modeling are continuously emerging. One such breakthrough is the LLaVA model, a sophisticated tool blending vision and language models. In this article, we’ll explore how to implement LLaVA for image captioning, a task that marries the complexities of visual perception and natural language generation.

Setting the Stage

Our journey begins with setting up the LLaVA environment. We clone the LLaVA repository and install necessary dependencies, including Gradio — a handy tool for creating ML model demos.

%cd /content
!git clone -b v1.0 https://github.com/camenduru/LLaVA
%cd /content/LLaVA
!pip install -q gradio .

Loading the Model

The heart of this operation lies in loading the LLaVA model. We utilize a 13 billion parameter version, optimized for efficient memory usage through 4-bit quantization. This approach significantly reduces the model’s footprint without compromising its performance.

from transformers import AutoTokenizer, BitsAndBytesConfig
from llava.model import LlavaLlamaForCausalLM
import torch

model_path = "4bit/llava-v1.5-13b-3GB"
...
model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

Integrating Vision Capabilities

The LLaVA model isn’t just about text; it integrates a vision tower for processing images. This integration allows the model to perceive and interpret visual data, an essential step for image captioning.

...
vision_tower = model.get_vision_tower()
...

Captioning Images

The crux of our exploration is the caption_image function. This function enables the model to generate captions for images, either fetched from URLs or loaded from files. It showcases the model's ability to understand and describe visual content in a contextually relevant manner.

def caption_image(image_file, prompt):
...
return image, output

Demonstrating the Model’s Prowess

To illustrate the model’s capabilities, we caption an image with a prompt focusing on describing image details and color.

image, output = caption_image('https://llava-vl.github.io/static/images/view.jpg', 'Describe the image and color details.')
print(output)

Complete Code Walkthrough

For those eager to dive straight into the code, here’s the complete script encapsulating everything from setting up the environment to generating image captions using LLaVA. This script is a testament to the power of integrating vision and language models in a single framework.

# Setting up the environment
%cd /content
!git clone -b v1.0 https://github.com/camenduru/LLaVA
%cd /content/LLaVA
!pip install -q gradio .

# Importing necessary libraries
from transformers import AutoTokenizer, BitsAndBytesConfig
from llava.model import LlavaLlamaForCausalLM
import torch

# Model configuration and initialization
model_path = "4bit/llava-v1.5-13b-3GB"
kwargs = {"device_map": "auto", "load_in_4bit": True, "quantization_config": BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')}
model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

# Setting up the vision tower
vision_tower = model.get_vision_tower()
if not vision_tower.is_loaded:
vision_tower.load_model()
vision_tower.to(device='cuda')
image_processor = vision_tower.image_processor

# Additional imports and configurations
import os
import requests
from PIL import Image
from io import BytesIO
from llava.conversation import conv_templates, SeparatorStyle
from llava.utils import disable_torch_init
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from transformers import TextStreamer

# Function to caption images
def caption_image(image_file, prompt):
...
return image, output

# Using the function to caption an image
image, output = caption_image('https://llava-vl.github.io/static/images/view.jpg', 'Describe the image and color details.')
print(output)

Conclusion

LLaVA represents a significant step in Generative AI, where the boundaries between visual and textual understanding are increasingly blurred. This code walkthrough offers a glimpse into how such advanced models can be harnessed for practical and innovative applications in image captioning.

--

--

Tony Esposito

Generative AI SME, Conversational AI systems specialist