A Comparative Analysis of the Image-reading Abilities of LLMs

3 min readSep 8, 2024

“Battle of the Bastards” — Game of Thrones: Season 6, Episode 9

With the rise of the capabilities of large language models, people nowadays can do almost anything and ask nearly anything from an LLM, as if it is a real person. Keeping this in mind, we also wanted to test one of the many capabilities of these LLM models.

We tested the Visual Question Answering(VQA) abilities of ChatGPT and Claude. In this blog, we will look into the responses that these models generated, and their strengths based on the complexity of the questions and will discuss what these results mean for the VQA as a whole.

Experiment Setup

For this experiment, we selected a total of 5 images from the famous drama series “Game of Thrones” and asked these models two questions with the same context. Questions that were asked are as follows and they varied in complexity, from asking about the event to the narratives.

1. What event is likely happening in this image?

2. What story or narrative could be inferred from this image?

Both the models provided us with different answers, which were then evaluated using a binary labeling system: 1 indicated a satisfactory response, and 0 indicated an unsatisfactory response.

The final verdict was based on which AI gave the most satisfactory responses.

For a detailed view of the data and analysis, click on the below embed to access the full spreadsheet.

VQA Blog Observations.xlsx

docs.google.com

Results and Analysis

The results showed differences in how the answers of these two models were different based on the visual data that were given to them.

Accuracy:

ChatGPT: It was particularly well in answering the question that required detailed descriptions. For instance, when asked for a narrative, most of the time, it provided a detailed description and an imaginative response, which was often time quite accurate with the storyline of the scene of which the image was.
Claude: It performed better when asked the straightforward question, and provided straightforward answers. When the question was about the elements of the image, its answer was to the point. This suggests that Claude is better when tasks require a clear observation.

2. Equal Performances:

In some cases though, the models performed quite well and there were a lot of similarities in the answers. For example, some images were more clear and well-defined in terms of the scenes. In these types of images, both ChatGPT and Claude were accurate and relevant. This is an indication that both models have a clear baseline of understanding visual content.

3. Performance Discrepancies:

Interestingly, ChatGPT was more likely to give imaginative answers even when the image had a clear and simple scene, where a simple explanation would have been sufficient. In some responses, we can say that ChatGPT was somewhat hallucinating.
Claude, on the other hand, sometimes missed subtle aspects of the images that ChatGPT excelled at explaining. This suggests that while Claude is good at answering simpler and direct questions, but lacks the ability when asked more abstract questions.

Insights:

Strengths and Limitations:

ChatGPT is a good choice for tasks needing interpretation because of its ability to generate unique responses. With its more direct approach, Claude does well on tasks that require simple explanations but struggles when a more complex comprehension is needed.

2. Complementary Skills:

The ways that Claude and ChatGPT evaluate visual data differ from each other, which could mean that these models could perform well together on VQA tasks. For example, combining Claude’s correctness with ChatGPT’s creativity could give more thorough answers.

3. Room for Improvement:

When it included correctly understanding images, both models showed limitations, particularly when images were complicated or needed an understanding of vague concepts. We could expect that these models will get even more efficient at handling VQA tasks as AI develops, maybe by including deeper reasoning abilities and better image-analyzing techniques.

Conclusion

This experiment focuses on the current state of LLM models in visual question answering. While both ChatGPT and Claude demonstrated impressive capabilities, they each have unique areas where they excel and areas where they lack their ability to analyze visual data. ChatGPT shines in giving the answers in detail when Claude excels in delivering concise and straightforward answers.

This whole experiment shows how these different AI models approach visual data, and we look forward to seeing how these models continue to evolve into something more powerful.

A Comparative Analysis of the Image-reading Abilities of LLMs

VQA Blog Observations.xlsx

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Akhilesh Rangani

No responses yet