Is Image Detection a Done Deal Finally

Yes, It is or seems to be very close to done with Very Large Visual Language Models

Published in

Better ML

4 min readMay 29, 2024

In 2018 after working for about a year and a half in using the then state-of-the-art Convolutional Neural Networks for object detection or image recognition, especially from drone videos and frustrated with difficulty in getting consistent results I wrote an article that Object Detection is not a Done Deal Yet; You can see it here https://medium.com/data-science-engineering/is-object-detection-a-done-deal-59a7be913fd2#:~:text=Wanted%20to%20put%20this%20in,not%20a%20done%20deal%20yet.

I will put a picture from that article here, on which the state of the art at that time model Yolo was run. It thought that this was a keyboard of a laptop seeing the white dots in the black background.

Rewind to today, Vision Language Models or models like ChatGPT 4/4o and Gemini can detect this. I had used these models to test some generic hand-drawn images and was impressed with how they could generalise. For example, the image below is the first memory I have of drawing a car as a child. I had a bit of OCD and tended to add more wheels to my drawings then. I put this to ChatGPT 4o

This was impressive as only very few systems can effectively generalise to such a high degree. So I gave this older picture to ChatGPT

ChatGPT4o able to detect no — understand the picture

In case you are not able to read the response in the picture -

This image appears to be an aerial view of a highway with multiple lanes and vehicles. It has several bounding boxes labeled “keyboard” and “laptop.” The bounding boxes are likely results from an object detection algorithm or model that has mistakenly identified parts of the scene as keyboards and laptops. This could be due to a misconfiguration of the model or incorrect training data.

The scene itself shows a busy highway with adjacent desert or barren land on one side, possibly indicating a location in a more arid region. The presence of multiple labels suggests an attempt at object detection, but the accuracy is compromised given the unlikely identification of keyboards and laptops on a highway.

I rest my case. There is no need of CapsuleNet, the future is here already and image detection the most difficult part seems now to be a Done deal after all or very much on its way to.

I am almost getting this vibe from the famous movie -You want the Truth, You cannot handle the Truth. The Truth is staring now from our web page and getting uncannily able to understand images.

To see if this has a spatial understanding I tried the trick image and it needed a bit of prodding but could get it. I don't want to start a comparison here but Gemini pro could not get it as of today.

Then I gave it a proper image and it told the same story, but now after prodding it got it

“Typically, faucets are designed with the handle oriented upwards and the spout directed downwards to allow for easy water flow and accessibility. Therefore, the second image seems to depict a more standard installation.

The first image appears to show an unusual orientation where the spout is directed upwards, which might not be practical for regular use and could result in water splashing upwards.

Based on standard installation practices, the faucet in the second image (the later one) is installed properly.”

And a consolation prize for Gemini Pro, for noticing the insect, which took ChatGPT 4o a little prodding to get

And finally thank you Meta/Facebook AI Research for Open Sourcing this great resource today which will help us understand this better https://x.com/AIatMeta/status/1795499770519392499 An Introduction to Vision-Language Modeling

Is Image Detection a Done Deal Finally

Yes, It is or seems to be very close to done with Very Large Visual Language Models

Written by Alex Punnen