Talk To Your Image — A Step-by-Step LLaVa-1.5

Published in

Automation Architech

7 min readOct 16, 2023

What is LLaVa?

LLaVA (Large Language-and-Vision Assistant) is a model that can be trained end-to-end by combining a Vision encoder and LLM.

A vision encoder processes visual data like images and transforms it into a latent representation.

On the other hand, the LLM processes data from both the vision encoder and text input to generate a response.

LLaVA trains these two components end-to-end to enable multimodal visual-linguistic transformation.

As a result, LLaVA showed high performance in visual reasoning ability, as an early study in visual instruction tuning.

LLaVA challenges

However, LLaVA underperformed on academic benchmarks that demand short-form responses, like providing the number of the correct options from a given set

This challenge is believed to be due to the fact that LLaVA is not pre-trained on large-scale data, as in other studies.
(Supplement: LLaVA uses image-text conversation data automatically generated by GPT-4)

Research purpose and overview

n this research, we conducted the following investigations and verifications with the main purpose of improving the…

Talk To Your Image — A Step-by-Step LLaVa-1.5

What is LLaVa?

LLaVA challenges

Research purpose and overview

Written by Gao Dalie (高達烈)