Member-only story
I Tested Frontline M-LLMs on Their Chart Interpretation Skills
Can multimodal LLMs infer basic charts accurately?
Multimodal LLMs (MLLMs) promise that they can interpret anything on an image. It’s true for most cases, such as image captioning and object detection.
But can it reasonably and accurately understand data presented on a chart?
If you really want to build an app that tells you what to do when you point your camera at a car dashboard, the LLMs chart interpretation skills should be exceptional.
Of course, Multimodal LLMs can narrate what’s on a chart, but consuming data and answering complex user questions is challenging.
I wanted to find out how difficult it is.
I set up eight challenges for LLMs to solve. Every challenge has a rudimentary chart and a question for the LLM to answer. We know the correct answer because we created the data, but the LLM needs to figure it out only using the visualization given to it.
As of writing this, and according to my understanding, there are five prominent Multimodal LLM providers in the market: OpenAI (GPT4o), Meta Llama 3.2 (11B & 90B models), Mistral with its brand new Pixtral 12B, Cloude 3.5 Sonnet, and Google’s Gemini 1.5.