How GPT-4V(ision) will revolutionise image annotation

Rishi Swethan
5 min readOct 15, 2023

--

We’ve all heard of smaller LLMs and smaller text classifiers being trained on ChatGPT’s responses. But we are entering a new era where LLMs can accurately interpret images. Until now, ChatGPT’s vision capabilities were just a neat party trick, but recent improvements in this model have made it so that its multi-model capabilities can outperform many purpose-built vision models. Although this is a tremendous improvement, vision is not yet solved. There are fundamental limitations to ChatGPT, which we use ChatGPT to overcome.

Images are just a bunch of numbers arranged in a square. A matrix if you will. So far, we have only been able to train models to solve narrow problems, such as classification, localisation, etc. Some models were able to “describe” an image, but they quickly fell apart because of their accuracy, and their inability to discern more complex aspects of an image. This is where ChatGPT comes in. Due to its improved accuracy, we can now ask it to “annotate” images in the way we need it.

How does this help with annotating images?

Let’s say we need to train a tiny model on tasks for which you don’t have enough annotations. We can now collect random related images from the internet using a script and use ChatGPT’s API to annotate them. We can then ask a human to review them, which only takes a fraction of the time it would take to actually annotate them.

Why train a new model instead of using ChatGPT’s API directly

In numerous instances, it may be beneficial to simply use ChatGPT’s API in your product, but there are 3 big reasons why this may not be an ideal solution.

Cost

It costs around $0.12 to get a response for a 224x224x3 image. Most low-cost annotation services, like the one I run at serna.ai, offer lower prices with human annotators. So why use this roundabout method? We can use ChatGPT to annotate images much faster than a human would. We would then simply ask an annotator to review and perfect these annotations. There may also be cases where ChatGPT can be more reliable than a human, such as in example 1 below.

If this is expensive to annotate, you can imagine why running in real-time would be prohibitive.

Latency

GPT-4 operates via an inference API, and its response duration can vary, taking several seconds based on the input and output length. Thus, GPT-4 might not be ideal for computer vision tasks demanding immediate, edge-level response times, such as in smartphones.

Limitations of a hosted GPT-4

As GPT-4 is provided through an API, businesses must be at ease with accessing an external API. GPT-4 won’t apply to computer vision tasks that demand offline processing.

Ideal use cases of this model

Although this model can perform very well in familiar images, we must test it extensively in its chat interface before actually using it in the real world. In scenarios where accuracy is paramount, it is always best to use a human annotator to do this or use a hybrid approach where you annotate using the API, and then use a human to review these annotations.

Let’s look at some quick examples.

Example 1:

Annotating a screening model that would run on a phone to check if the lighting conditions of the face are good enough, based on custom factors that we can mention, before feeding it to another model or a human for further analysis.

ChatGPT rates the lighting conditions for a specific use case

Example 2:

You are trying to train a CCTV-based model that automatically calls a janitor or a robot, based on how dirty the floor is. You cannot rely on an LLM to do this, as compute costs and response times will make it prohibitive when running it on multiple cameras 24x7.

ChatGPT estimates the time needed to clean the floor

Example 3:

You are trying to train a model to quickly analyse a person walking into a building.

ChatGPT identifies the various attributes of the woman in the picture, such as the colour of her dress, watch presence, etc.

Although these are impressive, the model is not yet infallible, as seen below:

ChatGPT returning the bounding box coordinates of a telephone pole incorrectly
ChatGPT counting the number of dots on the dice incorrectly

How do I use this approach in my application

  • Upload a few images to ChatGPT, try a few different prompts and see if it lives up to your expectations
  • If it performs with at least 90–95% accuracy, use the ChatGPT API and prompt it correctly to give you a simple output, as seen in the above examples
  • Write a script to convert these outputs into a useful annotation format
  • Upload these annotations to an annotation platform and ask a human annotator to fix any mistakes

Conclusion

Although this may look excellent, this is not a silver bullet. Use it where it is ideal, and consider other automation available to you as well.

If you need help with any of your computer vision-related problems, including annotation, send me an email at rishi@serna.ai. We provide end-to-end solutions for all your computer vision needs, including development, deployment and low-cost, high-efficiency annotation.

--

--

Rishi Swethan

AI consultant specialising in computer vision and predictive analytics. Website: serna.ai Email: rishi@serna.ai