Talk To Your Image — A Step-by-Step LLaVa-1.5
What is LLaVa?
LLaVA (Large Language-and-Vision Assistant) is a model that can be trained end-to-end by combining a Vision encoder and LLM.
A vision encoder processes visual data like images and transforms it into a latent representation.
On the other hand, the LLM processes data from both the vision encoder and text input to generate a response.
LLaVA trains these two components end-to-end to enable multimodal visual-linguistic transformation.
As a result, LLaVA showed high performance in visual reasoning ability, as an early study in visual instruction tuning.
LLaVA challenges
However, LLaVA underperformed on academic benchmarks that demand short-form responses, like providing the number of the correct options from a given set
This challenge is believed to be due to the fact that LLaVA is not pre-trained on large-scale data, as in other studies.
(Supplement: LLaVA uses image-text conversation data automatically generated by GPT-4)
Research purpose and overview
n this research, we conducted the following investigations and verifications with the main purpose of improving the…