Doing Computer Vision using Language Model

The year is 2023, and the world has just witnessed the invention of GPT-4. It is a great Language Model (LM) from open-ai capable of doing various tasks. At the moment we don’t know the whole scope of the impacts of this technological advance.

The latest announced GPT-4 feature includes multi-modality: you can give images into the model. This triggers a question:

Can Language Models understand images without images?

Before you can input a image into GPT, let’s try this: convert images into text.

In fact, everything can be represented as text. You can write code to create images, audio, diagrams, etc. For example: you can convert images into code (like SVG).

Follow the open-ai API setup steps if you want to run these experiments.

First, let see if GPT knows how to draw a line:

Can it do numbers like in MNIST ?

Gray scale image of number is converted into 0 and 1s characters and fed into the model. And it can classify and generate numbers given some context.

Can it do objects maybe FashionMNIST ?

The data is a collection of gray scale image of wearables. Each pixel is converted into a 0 to 9 number to denote pixel intensity. And it can classify and generate a fashion item not present in the context to certain extent.

Thus, even though Language Models only take text inputs. It is able to “visualize”.

Can it do embedding?

Let try another level. How about we quantize an embedding, convert into text and fed it to the model.

It can classify. But it’s accuracy isn’t great for this zero-shot scenario.

The model can be fine-tuned for the embedding inputs. And get better accuracy for a computed vision task.

It seems text is multi-modal to certain extend. We can use Language Models not only for Natural Language Processing but also across various modalities.

TLDR: notebook

--

--

AI stuff

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store