GPT-4V(ision) A compendium of all use cases.


Prominent ML Figures

Seminal Paper: The Dawn of LMMs (Large Multimodal Models)

Sign-up for GPT-V: ChatGPT can now see, hear, and speak (

The field of artificial intelligence has witnessed unprecedented advancements, with large language models (LLMs) like GPT-4 making headlines for its remarkable versatility and capabilities across various domains and tasks. Building on the success of LLMs, the next frontier in AI research is large multimodal models (LMMs). These models aim to integrate multiple sensory modalities, such as vision and language, to achieve even stronger general intelligence.

One of the primary motivations for LMMs is the dominance of vision in human perception. As such, many LMM research endeavors commence by extending the vision capabilities of these models. This extension can involve fine-tuning a vision encoder to align with a pre-trained LLM or using a vision-language model to convert visual inputs into text descriptions that LLMs can comprehend. Below, we delve into our preliminary explorations with GPT-4V, a state-of-the-art LMM with vision, built on the foundation of a leading LLM and trained with a vast amount of multimodal data.


Following Text Instructions