An Experiment to Unlock Ollama’s Potential in Video Question Answering

Manish Kumar Thota
6 min readJun 30, 2024

--

GSoC’24 @ Red Hen Labs

An Experiment to Enable Video Understanding in Ollama

What is Ollama

Ollama is a pioneering and popular tool for running large language models locally, boasting 76.1k stars and 5.7k forks on GitHub. As an open-source alternative to closed-source models, Ollama emphasizes privacy, cost savings, and user-friendliness. It enhances data security and control by running models on personal infrastructure, offers easy integration through REST API and libraries for Python and JavaScript.

Key benefits of Ollama include enhanced performance with faster data processing and optimized resource utilization, as well as cost efficiency by eliminating expensive cloud services. Its focus on data privacy through secure local handling and compliance with data protection regulations sets it apart.

Moreover, Ollama provides model customization for tailored solutions and flexible integration with popular AI tools. The user-friendly interface, combined with comprehensive support and an active community, makes Ollama accessible and efficient for users, marking a new era in AI-driven solutions.

What is llama.cpp

Llama.cpp is a versatile and powerful C/C++ library designed to facilitate the inference of Meta’s LLaMA model and other large language models (LLMs). With over 8.7k forks and 61k stars on GitHub, its primary goal is to enable high-performance inference with minimal setup across various hardware platforms, including local machines and cloud environments. Llama.cpp is optimized for Apple silicon via ARM NEON, Accelerate, and Metal frameworks, and it supports AVX, AVX2, and AVX512 for x86 architectures. This library also includes support for multiple levels of integer quantization, which helps in achieving faster inference times and reduced memory usage. Furthermore, it offers custom CUDA kernels for NVIDIA GPUs and supports AMD GPUs through HIP, Vulkan, and SYCL backends, making it a robust solution for diverse hardware configurations.

One of the standout features of llama.cpp is its commitment to being a pure C/C++ implementation without any external dependencies. This design choice ensures that it can be easily integrated into a wide range of projects and platforms, including macOS, Linux, Windows (via CMake), Docker, and FreeBSD. The library supports a variety of models, including different versions of LLaMA, Mistral, Mixtral MoE, DBRX, Falcon, and many others, making it a comprehensive tool for LLM inference. Additionally, it has provisions for multimodal models like LLaVA and ShareGPT4V, which further extends its applicability in various AI-driven applications.

The project has a well-defined roadmap and has seen numerous updates and improvements since its inception. Recent changes include restructuring the source code and CMake build scripts, updating the embeddings API for compactness, and adding support for multi-GPU pipeline parallelism. These enhancements are part of llama.cpp’s ongoing efforts to provide state-of-the-art performance and usability. The library also includes tools and scripts for model conversion, quantization testing, and support for various new hardware features, ensuring it remains at the forefront of LLM inference technology.

Llama.cpp also offers a lightweight HTTP server that is OpenAI API compatible, allowing users to serve local models and connect them easily to existing clients. This feature makes it convenient for developers to integrate LLM capabilities into their applications without needing extensive modifications. Additionally, the library includes bindings for multiple programming languages, such as Python, Go, Node.js, Rust, C#, and many others. These bindings provide flexible integration options, catering to developers with different language preferences and project requirements.

The comprehensive ecosystem around llama.cpp includes a variety of UI projects, tools, and utilities that enhance its functionality and ease of use. Open-source projects like iohub/collama, nat/openplayground, and LocalAI provide user interfaces and additional features built on top of llama.cpp. These tools help developers quickly get started with LLM inference, experiment with different models, and deploy their solutions effectively. Overall, llama.cpp stands out as a robust, flexible, and highly performant library for LLM inference, supporting a wide range of models and hardware configurations while offering extensive community support and continuous improvements.

Does Ollama provide support processing Videos?

While the answer is currently no, it is possible. Although the solution isn’t perfect yet, I’ve followed the steps below. The results aren’t flawless, but I believe the open-source community can help improve it. Note: there are scenarios that the model would not even detect the presence of video, requiring further debugging and developement.

Hosted model in Ollama, Please give it a try, the experiment is still under development
Here’s a video of Ollama running locally without an Internet connection, using the LLava_Next_Video model.

Steps I took to Quantise the Video Language Model to GGUF

llama.cpp is the savior. The commands below are from llama.cpp, as Ollama accepts files in GGUF formats, so we need to first convert them. This process is not straightforward, given the lack of support for videos — they only support images. Choose a model that is compatible with both vision and language; in our case, we chose lmms-lab/LLaVA-NeXT-Video-7B-DPO for its superior processing of both videos and images.

Step 1: Llava-surgery

The below command will save llava.projector and a llava.clip file in your model directory

python examples/llava/llava-surgery-v2.py -C -m /home/manish/llama.cpp/model_downloads/LLaVA-NeXT-Video-7B-DPO

Step2: Create a new directory and save the processed files

cp /home/manish/llama.cpp/model_downloads/LLaVA-NeXT-Video-7B-DPO/llava.clip vit/pytorch_model.bin
cp /home/manish/llama.cpp/model_downloads/LLaVA-NeXT-Video-7B-DPO/llava.projector vit/
curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json

Step 3: Create the vision projector in GGUF format. You will have the options to convert it to F32, F16, and Q8_0 formats.

python ./examples/llava/convert-image-encoder-to-gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision

Step 4: Verify everything is working by running the following command.

./llama-llava-cli -m LLaVA-NeXT-Video-7B-DPO-7B-F16 --mmproj /mmproj-model-f16.gguf --image /llama-leader.jpeg -c 4096
Input an image for testing the model.

Output from the 7B parameter projector

It is clear that they are not camels. Perhaps the 13B parameter model could perform better in a quantized format.

I’ve also tried two other models with Llama and Mistral variants. Don’t worry, all the files are uploaded to Hugging Face. I need the community’s help to fill the gaps, correct the errors, and improve the model.

As part of Google Summer of Code, I will be working on Video. However, I am putting the work with Ollama models aside for now, as they have significant room for improvement. Instead, I will be dockerizing my solution for the Chat-UniVi model and will soon share a demo. This will allow everyone to try out the video annotation solution.

If you would like to collaborate, please reach out to me on LinkedIn 📩🤝.

--

--