Vertical Scaling in Video Annotation with Large Language Models: A Journey with GSoC’24 @ Red Hen Labs

Research is All You Need | GSoC’24 | Progress So Far

11 min readJun 20, 2024

The realm of video processing remains an active area of research due to its inherent complexity in computing multiple modalities simultaneously, creating a true sense of multimodality. This blog delves into the technical aspects of our ongoing project under Google Summer of Code (GSoC) 2024 and Red Hen Labs, focusing on annotating videos containing temporal and grounding information using Vision-Language models.

Problem Statement

Our goal is to annotate videos that contain both temporal and grounding information. Instead of building new models from scratch, we aim to scale vertically by leveraging existing Vision-Language models. Inspired by the work of Sam Altman and his team at OpenAI, who build great products around large language models, we recognize the rapid improvements in open-source models. Consequently, we believe that horizontal growth — developing new LLMs from scratch — may soon reach saturation. Instead, we should focus on creating robust products around these models, scaling vertically by fine-tuning them for specific applications such as medical, legal, and recommendation systems.

Full video where Sam mentions scaling and the next era of AI.

Motivation

At Red Hen Labs, through Google Summer of Code, I am contributing to vertical growth by developing an annotation product for the video space using large language models. This approach ensures that we build effective, domain-specific applications rather than generic models.

Importance of Structuring Models

We cannot always use models out of the box; hence, we must structure them well to achieve the desired outputs. Following the recommendations from the mentors, My first step is to test the capabilities of Video Large Language Models by annotating the following four key entities among many others:

Screen Interaction: Determine if the subject in the video is interacting with a screen in the background.
Hands-Free: Check if the subject’s hands are free or if they are holding anything.
Indoors: Identify whether the subject is indoors or outdoors.
Standing: Observe if the subject is sitting or standing.

The Journey Ahead

We are in an era where new open-source models emerge monthly, continuously improving. This progress necessitates focusing on developing great products around these models, which involve vertical scaling, such as fine-tuning models for specific domains. This approach not only optimizes the use of existing models but also accelerates the development of practical and effective solutions.

Dataset Preview

Here is a glimpse of the news dataset that we will be annotating, showcasing the real-world application of our annotation models.

By focusing on these areas, we aim to push the boundaries of what is possible with Vision-Language models, contributing to the future of multimodal video processing.

Video Frames and Key Entities

All of the video frames we analyzed are sourced from news segments, each lasting approximately 4–5 seconds. To accurately capture the main key entities from these models, I have extensively experimented with prompt engineering, employing multiple variations and different models. The most effective prompt, yielding outstanding results.

The Golden Prompt

For each question, analyze the given video carefully and base your answers on the observations made.
Examine the subject’s right and left hands in the video to check if they are holding anything like a microphone, book, paper(White color), object, or any electronic device, try segmentations and decide if the hands are free or not.
Evaluate the subject’s body posture and movement within the video. Are they standing upright with both feet planted firmly on the ground? If so, they are standing. If they seem to be seated, they are seated.
Assess the surroundings behind the subject in the video. Do they seem to interact with any visible screens, such as laptops, TVs, or digital billboards? If yes, then they are interacting with a screen. If not, they are not interacting with a screen.
Consider the broader environmental context shown in the video’s background. Are there signs of an open-air space, like greenery, structures, or people passing by? If so, it’s an outdoor setting. If the setting looks confined with furniture, walls, or home decorations, it’s an indoor environment.
By taking these factors into account when watching the video, please answer the questions accurately.

Selecting the Best Model During the Coding Period (Research is Key)

During the coding period, extensive research was conducted to identify the best model for video processing. Among the latest research, the best model turned out to be Chat-UniVi. Kudos to the team behind it. The processing and description of the video using Chat-UniVi is by far the best.

Below is a comparison table of the various models we researched. This table includes a brief description and a link to each model’s repository for further exploration.

Outstanding Features of Chat-UniVi

Chat-UniVi excels in processing and describing video content by capturing the spatial details necessary for images and the comprehensive temporal relationships required for videos. This model’s ability to handle both aspects makes it a standout choice for video processing tasks.

Sample Input Video

Chat-UniVi Generated Output — Step 1

The woman in the video is standing and holding a microphone. She is standing in front of a bus and a news reporting set. The woman is not interacting with any visible screens, and there are no signs of greenery or people passing by in the background. Therefore, the setting is an Outdoor environment.

Structured Pydantic Output — Step 2

{
“screen_interaction_yes”: 0,
“hands_free”: 0,
“indoors”: 0,
“standing”: 1
}

Video Processing Explanation

Below are the key snippets and explanations for how the video is processed using the Chat-UniVi model our golden model.

Reading and Sampling Video Frames:

The VideoReader from decord library is used to read the video efficiently.
Frames are uniformly sampled based on the specified frame rate and maximum number of frames.

from decord import VideoReader, cpu
from PIL import Image
import torch
import numpy as np

def _get_rawvideo_dec(video_path, image_processor, max_frames=100, video_framerate=1):
    vreader = VideoReader(video_path, ctx=cpu(0))
    fps = vreader.get_avg_fps()
    num_frames = len(vreader)
    t_stride = int(round(fps / video_framerate))
    all_pos = list(range(0, num_frames, t_stride))
    sample_pos = all_pos[:max_frames]
    patch_images = [Image.fromarray(f) for f in vreader.get_batch(sample_pos).asnumpy()]
    patch_images = torch.stack([image_processor.preprocess(img, return_tensors='pt')['pixel_values'][0] for img in patch_images])
    return patch_images, len(patch_images)

Preprocessing Frames:

Each frame is preprocessed and converted to tensors suitable for model input. After thorough research and finalizing our golden model with robust grounding and temporal capabilities, we reached the stage of extracting structured outputs from the model. Our ultimate goal was to obtain precise annotations from the dataset.

patch_images = torch.stack([image_processor.preprocess(img, return_tensors='pt')['pixel_values'][0] for img in patch_images])

VideoReader: Efficiently decodes and reads video frames.

After thorough research and finalizing our golden model with robust grounding and temporal capabilities, we reached the stage of extracting structured outputs from the model. Our ultimate goal was to obtain precise annotations from the dataset.

There are various ways to refine and structure your LLM outputs, and one of the most effective methods is using Pydantic for structured outputs from LLMs. In the following sections, we will walk you through the code and steps necessary to replicate this approach for your use case. This method provides you with complete control over the outputs generated by the LLM, ensuring consistency and reliability in your data processing tasks.

Addressing the Challenge of structured outputs with Pydantic using Open source LLM’s

In the current landscape of AI, large language models (LLMs) are extensively integrated with existing software systems to enhance their capabilities. These software systems often rely on structured outputs, such as JSON from web API requests, to function effectively. While LLMs generate impressive and contextually rich responses, their inherent variability poses a challenge when consistent and structured outputs are required, especially outside the realm of chatbots.

The Challenge of Variability in LLM Responses

LLMs are known for their ability to generate contextually appropriate and detailed responses. However, this strength can also be a weakness when the task at hand requires consistent and structured outputs. The variability in responses can be significant even when the same prompt is provided multiple times. This inconsistency makes it difficult to process and utilize the outputs in a predictable and reliable manner. For many applications, a stable and structured output is essential for downstream processes and integrations.

The Role of Structured Outputs

Structured outputs are critical for the smooth functioning of many software systems. They ensure that the data can be easily parsed, validated, and integrated into existing workflows. In the absence of structured outputs, developers may face challenges in ensuring data consistency and reliability, leading to potential errors and inefficiencies in the system.

Code Explanation

We will be using the Phi-3-mini-128k-instruct model from Microsoft. This model offers a straightforward plug-and-play setup, meaning it can be easily replaced with other text generation open-source models such as LLaMA-3, Mistral-7b, Qwen-2, and others.

The following code is divided into sections for better clarity and understanding:

Part 1: Importing Necessary Libraries

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from pydantic import BaseModel
import json
import warnings
from fastapi import FastAPI
from typing import Dict

# Ignore warnings
warnings.filterwarnings(action='ignore')

# Set random seed
torch.random.manual_seed(0)

Import necessary libraries including torch, transformers, pydantic, json, warnings, fastapi, and typing.
Ignore warnings to keep the output clean.
Set a random seed for reproducibility.

Part 2: ModelLoader Class

class ModelLoader:
    _model = None
    _tokenizer = None
    _pipe = None

    @classmethod
    def load_model(cls, model_path):
        if cls._model is None or cls._tokenizer is None:
            cls._model = AutoModelForCausalLM.from_pretrained(
                model_path,
                device_map="cuda",
                torch_dtype="auto",
                trust_remote_code=True,
            )
            cls._tokenizer = AutoTokenizer.from_pretrained(model_path)
            cls._pipe = pipeline(
                "text-generation",
                model=cls._model,
                tokenizer=cls._tokenizer,
            )
        return cls._pipe

ModelLoader is a singleton class that loads the LLM model and tokenizer if they are not already loaded.
The load_model method initializes the model, tokenizer, and pipeline for text generation.

Part 3: Generation Arguments

generation_args = {
    "max_new_tokens": 50,
    "return_full_text": False,
    "temperature": 0.1,
    "do_sample": True
}

Define the generation arguments for the LLM, including the maximum number of new tokens, whether to return the full text, the temperature for sampling, and whether to enable sampling.

Part 4: LLMHelper Class

class LLMHelper:
    def __init__(self, pipeline):
        self.chatbot = pipeline

    def generate_logic(self, llm_output: str):
        prompt = f"""
        Provide the response in json string for the below keys and context based on the description: '{llm_output}'.
        
        Screen.interaction_yes: This field indicates whether there was an interaction of the person with a screen during the activity. A value of 1 means there was screen interaction (Yes), and a value of 0 means there was no screen interaction (No).
        Hands.free: This field indicates whether the person's hands were free during the activity. A value of 1 means the person was not holding anything (Yes), indicating free hands. A value of 0 means the person was holding something (No), indicating the hands were not free.
        Indoors: This field indicates whether the activity took place indoors. A value of 1 means the activity occurred inside a building or enclosed space (Yes), and a value of 0 means the activity took place outside (No).
        Standing: This field indicates whether the person was standing during the activity. A value of 1 means the person was standing (Yes), and a value of 0 means the person was not standing (No).
        """

        messages = [
            {"role": "system", "content": "Please answer questions just based on this information: " + llm_output},
            {"role": "user", "content": prompt},
        ]

        response = self.chatbot(messages, **generation_args)
        generated_text = response[0]['generated_text']
        # Extract JSON from the generated text
        start_index = generated_text.find('{')
        end_index = generated_text.rfind('}') + 1
        json_str = generated_text[start_index:end_index]
        return json_str

LLMHelper class is initialized with the text generation pipeline.
The generate_logic method creates a prompt to generate a structured JSON response based on the provided LLM output and the predefined keys (screen interaction, hands-free, indoors, standing).

Part 5: VideoAnalysis Model

class VideoAnalysis(BaseModel):
    screen_interaction_yes: int
    hands_free: int
    indoors: int
    standing: int

    @classmethod
    def from_llm_output(cls, llm_output: str, generated_logic: str) -> 'VideoAnalysis':
        # Parse the generated logic (assuming it's a JSON string)
        logic_dict = json.loads(generated_logic)
        
        return cls(
            screen_interaction_yes=logic_dict.get("Screen.interaction_yes", 0),
            hands_free=logic_dict.get("Hands.free", 0),
            indoors=logic_dict.get("Indoors", 0),
            standing=logic_dict.get("Standing", 0)
        )

VideoAnalysis is a Pydantic model that defines the structure of the output with the specified keys.
The from_llm_output method parses the generated JSON logic and initializes the VideoAnalysis model with the corresponding values.

Part 6: Loading the Model and Initializing FastAPI

# Define the model path
model_path = "MODEL_PATH"

# Load the model and pipeline
pipe = ModelLoader.load_model(model_path)
llm_helper = LLMHelper(pipe)

# Initialize FastAPI
SLLM_Output_app = FastAPI()

class LLMInput(BaseModel):
    llm_output: str

@SLLM_Output_app.post("/process_llm_output/")
def process_llm_output(input: LLMInput) -> Dict:
    # Generate the logic from the LLM output
    generated_logic = llm_helper.generate_logic(input.llm_output)
    
    # Create the structured output
    structured_output = VideoAnalysis.from_llm_output(input.llm_output, generated_logic)
    
    # Return the structured output as a dictionary
    return structured_output.dict()

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(SLLM_Output_app, host="0.0.0.0", port=8000)

Define the model path and load the model using ModelLoader.
Initialize the LLMHelper with the loaded pipeline.
Initialize the FastAPI application.
Define the input model LLMInput for the FastAPI endpoint.
Create the /process_llm_output/ POST endpoint that processes the LLM output to generate structured JSON.
Generate the logic from the LLM output using LLMHelper.
Create the structured output using the VideoAnalysis model and return it as a dictionary.
Use uvicorn to run the FastAPI application on the specified host and port.

The provided code example demonstrates how to use Pydantic to extract specific binary values from an LLM response, ensuring that these key elements are consistently formatted and validated. This approach allows for reliable and predictable processing of LLM-generated data in various applications. By leveraging Pydantic, developers can ensure that the outputs from LLMs adhere to predefined structures, facilitating easier integration and enhancing the overall efficiency and reliability of their systems.

Next Steps 🚀

Moving forward, we will integrate these microservices into a Gradio app and establish a defined workflow. This workflow will take a folder with videos as input, process each video, and store the annotations in a CSV file.

The app will first accept a folder containing multiple video files for processing. Each video will be analyzed using the Chat-UniVi model, and the annotations will be extracted using the custom Pydantic class. These annotations will then be saved in a CSV file, providing a structured format for easy access and further analysis.

I am excited to work on this part as it involves a lot of engineering and, obviously, a lot of code! 💻🔧

Stay tuned for more updates and code snippets. Follow along for the latest progress!