Talk To Your Image — A Step-by-Step LLaVa-1.5

Gao Dalie (高達烈)
Automation Architech
7 min readOct 16, 2023

--

What is LLaVa?

LLaVA (Large Language-and-Vision Assistant) is a model that can be trained end-to-end by combining a Vision encoder and LLM.

A vision encoder processes visual data like images and transforms it into a latent representation.

On the other hand, the LLM processes data from both the vision encoder and text input to generate a response.

LLaVA trains these two components end-to-end to enable multimodal visual-linguistic transformation.

As a result, LLaVA showed high performance in visual reasoning ability, as an early study in visual instruction tuning.

LLaVA challenges

However, LLaVA underperformed on academic benchmarks that demand short-form responses, like providing the number of the correct options from a given set

This challenge is believed to be due to the fact that LLaVA is not pre-trained on large-scale data, as in other studies.
(Supplement: LLaVA uses image-text conversation data automatically generated by GPT-4)

Research purpose and overview

n this research, we conducted the following investigations and verifications with the main purpose of improving the…

--

--

Gao Dalie (高達烈)
Automation Architech

Learn AI Agent, LLMs, RAG & Generative AI See everything I have to offer at the link below: https://linktr.ee/GaoDalie_AI