Multimodal LLMs (MLLMs) promise that they’ll interpret something on a picture. It’s true for many instances, resembling picture captioning and object detection.
However can it moderately and precisely perceive information introduced on a chart?
Should you actually need to construct an app that tells you what to do whenever you level your digital camera at a automotive dashboard, the LLMs chart interpretation expertise ought to be distinctive.
In fact, Multimodal LLMs can narrate what’s on a chart, however consuming information and answering complicated person questions is difficult.
I needed to learn the way tough it’s.
I arrange eight challenges for LLMs to unravel. Each problem has a rudimentary chart and a query for the LLM to reply. We all know the proper reply as a result of we created the info, however the LLM must determine it out solely utilizing the visualization given to it.
As of penning this, and in accordance with my understanding, there are 5 distinguished Multimodal LLM suppliers available in the market: OpenAI (GPT4o), Meta Llama 3.2 (11B & 90B fashions), Mistral with its model new Pixtral 12B, Cloude 3.5 Sonnet, and Google’s Gemini 1.5.