GPT4-Vision and its Alternatives

9 min readNov 27, 2023

Introduction

In September 2023, OpenAI introduced the functionality to query images using GPT-4. Just one month later, during the OpenAI DevDay, these features were incorporated into an API, granting developers the opportunity to construct applications that leverage GPT-4’s capabilities with image inputs.

Although GPT-4 with Vision has garnered considerable interest, it’s essential to note that this service is just one among numerous Large Multimodal Models (LMMs). Large Multimodal Models are a class of language models designed to process and understand various types of information, known as “modalities,” including but not limited to images and audio.

What is GPT-4 with Vision?

Introduced in September 2023, GPT-4 with Vision offers the capability to pose inquiries regarding the contents of images, a concept known as visual question answering (VQA). VQA is a well-explored domain within the field of computer vision. Additionally, GPT-4 with Vision facilitates other visual tasks, including Optical Character Recognition (OCR), where the model interprets characters within an image.

With GPT-4 and Vision, users can inquire about the presence or absence of elements in an image, the relationships between objects, spatial configurations (such as the relative positioning of one object to another), the color attributes of objects, and more.

Accessible through the OpenAI web interface for ChatGPT Plus subscribers and the OpenAI GPT-4 Vision API, GPT-4 with Vision extends its utility beyond the basic text domain.

It is important to note the following:

GPT-4 Turbo with vision may behave slightly differently than GPT-4 Turbo, due to a system message they automatically insert into the conversation.
GPT-4 Turbo with vision is the same as the GPT-4 Turbo preview model and performs equally as well on text tasks but has vision capabilities added.

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4-vision-preview",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0])

The model is best at answering general questions about what is present in the images. While it does understand the relationship between objects in images, it is not yet optimized to answer detailed questions about the location of certain objects in an image. For example, you can ask it what color a car is or what some ideas for dinner might be based on what is in you fridge, but if you show it an image of a room and ask it where the chair is, it may not answer the question correctly.

→ The Chat Completions API is capable of taking in and processing multiple image inputs in both base64 encoded format or as an image URL. The model will process each image and use the information from all of them to answer the question.

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
  model="gpt-4-vision-preview",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What are in these images? Is there any difference between them?",
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)
print(response.choices[0])

Low or High fidelity image understanding

By controlling the detail parameter, which has three options, low, high, or auto, you have control over how the model processes the image and generates its textual understanding. By default, the model will use the auto setting which will look at the image input size and decide if it should use the low or high setting.

low will disable the “high res” model. The model will receive a low-res 512px x 512px version of the image, and represent the image with a budget of 65 tokens. This allows the API to return faster responses and consume fewer input tokens for use cases that do not require high detail.
high will enable “high res” mode, which first allows the model to see the low res image and then creates detailed crops of input images as 512px squares based on the input image size. Each of the detailed crops uses twice the token budget (65 tokens) for a total of 129 tokens.

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4-vision-preview",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            "detail": "high"
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0].message.content)

Limitations

While GPT-4 with vision is powerful and can be used in many situations, it is important to understand the limitations of the model. some of the limitations mentioned by open ai documentation. Here are some of the limitations we are aware of:

Medical images: The model is not suitable for interpreting specialized medical images like CT scans and shouldn’t be used for medical advice.
Non-English: The model may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.
Small text: Enlarge text within the image to improve readability, but avoid cropping important details.
Rotation: The model may misinterpret rotated / upside-down text or images.
Visual elements: The model may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.
Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.
Accuracy: The model may generate incorrect descriptions or captions in certain scenarios.
Image shape: The model struggles with panoramic and fisheye images.
Metadata and resizing: The model doesn’t process original file names or metadata, and images are resized before analysis, affecting their original dimensions.
Counting: May give approximate counts for objects in images.
CAPTCHAS: For safety reasons, we have implemented a system to block the submission of CAPTCHAs.

Some Important points before working with GPT-4-Vision:

currently support PNG (.png), JPEG (.jpeg and .jpg), WEBP (.webp), and non-animated GIF (.gif).
restrict image uploads to 20MB per image.
If an image is ambiguous or unclear, the model will do its best to interpret it. However, the results may be less accurate. A good rule of thumb is that if an average human cannot see the info in an image at the resolutions used in low/high res mode, then the model cannot either.

Alternatives to GPT-4 with Vision

The computer vision industry is undergoing rapid advancements, with multimodal models increasingly shaping its landscape. Simultaneously, fine-tuned models are proving their worth across various applications, a discussion of which follows below.

OpenAI is part of a broader community of research teams delving into Large Multimodal Model (LMM) research. Over the past year, a multitude of models, including GPT-4 with Vision, LLaVA, BakLLaVA, CogVLM, Qwen-VL, and others, have emerged, all aiming to integrate text and image data to construct comprehensive LMMs.

Each model possesses its unique strengths and weaknesses. LLMs exhibit extensive capabilities, spanning from Optical Character Recognition (OCR) to Visual Question Answering (VQA), making direct comparisons challenging; this underscores the importance of benchmarking. Therefore, models discussed below do not seek to “replace” GPT-4 with Vision but rather provide alternatives.

Let’s delve into some of these alternatives to GPT-4 with Vision, highlighting their respective advantages and drawbacks.

Qwen-VL

Paper — https://arxiv.org/abs/2308.12966?ref=blog.roboflow.com

Code — https://github.com/QwenLM/Qwen-VL?ref=blog.roboflow.com

Qwen-VL is an LMM developed by Alibaba Cloud. Qwen-VL accepts images, text, and bounding boxes as inputs. The model can output text and bounding boxes. Qwen-VL naturally supports English, Chinese, and multilingual conversation. Thus, this model may be worth exploring if you have a use case where you expect Chinese and English to be used in prompts or answers.

Below, we use Qwen-VL to ask from what movie the photo was taken:

Qwen-VL successfully identifies the movie from the provided image as Home Alone.

CogVLM

Paper — https://arxiv.org/abs/2311.03079?ref=blog.roboflow.com

Code — https://github.com/THUDM/CogVLM?ref=blog.roboflow.com

CogVLM can understand and answer various types of questions and has a visual grounding version. Grounding the ability of the model to connect and relate its responses to real-world knowledge and facts, in our case objects on the image.

CogVLM can accurately describe images in detail with very few hallucinations. The image below shows the instruction “Find the dog”. CogVLM has drawn a bounding box around the dog, and provided the coordinates to the dog. This shows that CogVLM can be used for zero-shot object detection, as it returns coordinates of grounded objects. (Note: In the demo space we used, CogVLM plotted bounding boxes for predictions)

CogVLM, like GPT-4 with Vision, can also answer questions about infographics and documents with statistics or structured information. Given a photo of a pizza, we asked CogVLM to return the price of a pastrami pizza. The model returned the correct response:

ask questions on above images.

LLaVA

Paper — https://arxiv.org/abs/2304.08485?ref=blog.roboflow.com

Code — https://github.com/haotian-liu/LLaVA?ref=blog.roboflow.com

The Large Language and Vision Assistant (LLaVA) stands out as a notable Large Multimodal Model (LMM) crafted by the collaborative efforts of Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. As of the current writing, the latest iteration of LLaVA is version 1.5.

Considered by many as a leading alternative to GPT-4V, LLaVA 1.5 has gained prominence, particularly in the field of Visual Question Answering (VQA). Users can engage with LLaVA by posing questions about images and obtaining relevant answers, showcasing its applicability in multimodal tasks.

The Future of Multimodality

GPT-4 with Vision marked a significant milestone in bringing multimodal language models to a global audience. The integration of GPT-4 with Vision into the GPT-4 web interface empowered users worldwide to upload images and seek answers to questions related to them. However, it’s crucial to recognize that GPT-4 with Vision represents just one facet of the diverse landscape of multimodal models available.

The realm of multimodality is a dynamic field of ongoing research, witnessing the regular introduction of new models. Remarkably, all the models mentioned earlier were unveiled in the year 2023. Notable multimodal models like GPT-4 with Vision, LLaVA, and Qwen-VL showcase their proficiency in tackling an array of vision-related challenges, spanning from Optical Character Recognition (OCR) to Visual Question Answering (VQA).

This guide has provided insights into two primary alternatives to GPT-4 with Vision:

Fine-tuned Models: Optimal for tasks requiring precise identification of object locations within images.
Other LMMs: Models such as LLaVA, Qwen-VL, or CogVLM offer versatile solutions, particularly well-suited for Visual Question Answering (VQA) and similar tasks.

Given the rapid pace of development in the realm of multimodal models, it’s anticipated that numerous innovations will emerge in the coming years, further expanding the landscape of possibilities. Stay tuned for the exciting advancements that will shape the future of multimodal language models.

Conclusion

In summary, the emergence of Large Multimodal Models (LMMs) like GPT-4 with Vision, LLaVA, BakLLaVA, Qwen-VL, and CogVLM marks a transformative era in artificial intelligence. These models seamlessly combine language understanding with visual perception, addressing a range of tasks from Optical Character Recognition (OCR) to Visual Question Answering (VQA).

The dynamic nature of this field, evident in the continuous development of new models, underlines their growing significance. Beyond novelty, LMMs are becoming indispensable tools for developers and businesses, promising innovative solutions in natural language understanding and computer vision.

Looking forward, the trajectory of LMMs points to even more sophisticated applications and capabilities. As these models integrate into various domains, they have the potential to redefine how we approach complex tasks, ushering in a future where AI effortlessly comprehends and interacts with diverse forms of information.

A heartfelt thank you for reading this blog till the end! If you enjoyed the exploration of Large Multimodal Models, give it a virtual applause by clapping 👏 and don’t forget to follow my page for more exciting updates. Your support and engagement mean the world to me. Until next time, thank you for being part of this journey! 🙏✨ #ClapAndFollow

Blog — https://openai.com/research/gpt-4v-system-card