Microsoft has written a GPT-4V manual: 166 pages of complete and detailed explanations, prompt word demo examples are all available | Attached download

Multi-modal king model GPT-4V, 166-page ā€œinstruction manualā€ is released! And it is produced by Microsoft Team.

What kind of paper can be written in 166 pages?

It not only evaluates the performance of GPT-4V in ten major tasks in detail, but also demonstrates everything from basic image recognition to complex logical reasoning;

It also teaches a complete set of skills for using prompt words for multi-modal large models ā€” ā€”

It teaches you step by step how to write prompt words from 0 to 1, and the professional level of the answer is easy to understand at a glance. It really makes the threshold for using GPT-4V non-existent.

It is worth mentioning that the author of this paper is also an ā€œall-Chinese classā€. The seven authors are all Chinese, and the leader is a female chief research manager who has worked at Microsoft for 17 years.

Before the release of the 166-page report, they also participated in the research of OpenAIā€™s latest DALLĀ·E 3 and have a deep understanding of this field.

Compared with OpenAIā€™s 18-page GPT-4V paper, this 166-page ā€œEating Guideā€ was immediately regarded as a must-read for GPT-4V users:

Some netizens lamented: This is not a paper, it is almost a 166-page book.

Some netizens were already panicking after reading:

Donā€™t just look at the details of GPT-4Vā€™s answer. Iā€™m really scared of the potential capabilities demonstrated by AI.

So, what exactly does Microsoftā€™s ā€œpaperā€ talk about, and what ā€œpotentialā€ does it show about GPT-4V?

What does Microsoftā€™s 166-page report say?

This paper studies the method of GPT-4V, and its core relies on one word ā€” ā€œtryā€ .

Microsoft researchers designed a series of inputs covering multiple domains, fed them to GPT-4V, and observed and recorded GPT-4Vā€™s output.

Subsequently, they evaluated GPT-4Vā€™s ability to complete various tasks, and also gave new prompt word techniques for using GPT-4V, including four major aspects:

1. Usage of GPT-4V:

5 ways to use it: input images, sub-images, texts, scene texts and visual pointers.

3 supported capabilities: instruction following, chain-of-thoughts, and in-context few-shot learning.

For example, this is the instruction following ability demonstrated by GPT-4V after changing the questioning method based on the thinking chain:

2. GPT-4Vā€™s performance in 10 major tasks:

open-world visual understanding, visual description, multimodal knowledge, commonsense, scene text understandin, document reasoning, writing Coding, temporal reasonin, abstract reasoning, emotion understanding

Among them are this kind of ā€œimage reasoning questionsā€ that require some IQ to solve:

3. Prompt word skills for large multi-modal models similar to GPT-4V:

A new multi-modal prompt word technique ā€œvisual referring promptingā€ is proposed, which can indicate the task of interest by directly editing the input image and used in combination with other prompt word techniques.

4. Research & implementation potential of multi-modal large models:

Two types of areas that multimodal learning researchers should focus on are predicted, including implementation (potential application scenarios) and research directions.

For example, this is one of the possible scenarios for GPT-4V found by researchers ā€” fault detection:

But whether it is the new prompt word technology or the application scenarios of GPT-4V, what everyone is most concerned about is the true strength of GPT-4V.

Therefore, this ā€œinstruction manualā€ subsequently used more than 150 pages to show various demos, detailing the capabilities of GPT-4V in the face of different answers.

Letā€™s take a look at how far GPT-4Vā€™s multi-modal capabilities have evolved today.

Proficient in images in professional fields, you can also learn knowledge instantly

Image Identification

The most basic identification is of course a piece of cake, such as celebrities from all walks of life in technology, sports and entertainment circles:

And not only can you see who these people are, but you can also interpret what they are doing. For example, in the picture below, Huang is introducing Nvidiaā€™s new graphics card products.

In addition to people, landmark buildings are also a piece of cake for GPT-4V. It can not only determine the name and location, but also give detailed introductions.

ā–³ Left: Times Square in New York, right: Kinkakuji Temple in Kyoto

However, the more famous people and places are, the easier it is to judge, so more difficult pictures are needed to show the capabilities of GPT-4V.

For example, in medical imaging, for the following lung CT, GPT-4V gave this conclusion:

Consolidation and ground-glass opacities were present in multiple areas of both lungs, and there may be infection or inflammation in the lungs. There may also be a mass or nodule in the upper lobe of the right lung.

Even without telling GPT-4V the type and location of the image, it can judge it by itself.

In this image, GPT-4V successfully identified it as a magnetic resonance imaging (MRI) image of the brain.

At the same time, GPT-4V also found the presence of a large amount of fluid, which was considered likely to be a high-grade glioma.

After professional judgment, the conclusion given by GPT-4V is completely correct.

In addition to these ā€œseriousā€ contents, the ā€œintangible cultural heritageā€ emoticons of contemporary human society have also been captured by GPT-4V.

ā–³ Machine translation, for reference only

Not only can it interpret memes in emoticons, but the emotions expressed by human expressions in the real world can also be seen by GPT-4.

In addition to these real images, text recognition is also an important task in machine vision.

In this regard, GPT-4V can recognize other languages ā€‹ā€‹such as Chinese, Japanese, and Greek in addition to Latin spelling.

Even handwritten mathematical formulas:

Image reasoning

The DEMO shown above, no matter how professional or difficult to understand, is still in the scope of recognition, but this is just the tip of the iceberg of GPT-4Vā€™s skills.

In addition to understanding the content in the picture, GPT-4V also has certain reasoning capabilities.

To put it simply, GPT-4V can find the differences between the two images (although there are still some errors).

In the following set of pictures, the differences between the crown and the bow were discovered by GPT-4V.

If you increase the difficulty, GPT-4V can also solve the graphics problems in the IQ test.

The characteristics or logical relationships in the above three questions are relatively simple, but the difficulty will arise next:

Of course, the difficulty does not lie in the graphics themselves. Pay attention to the 4th text description in the picture. The arrangement of the graphics in the original question is not what is shown in the picture.

Image annotation

In addition to answering various questions with text, GPT-4V can also perform a range of operations on images.

For example, we have a group photo of four AI giants, and we need GPT-4V to frame the characters and label their names and brief introductions.

GPT-4V first answered these questions with text, and then provided the processed images:

Dynamic content analysis

In addition to these static contents, GPT-4V can also perform dynamic analysis, but it does not directly feed the model a video.

The five images below were taken from a tutorial video on making sushi. GPT-4Vā€™s task is to guess (based on understanding the content) the order in which these images appear.

For the same series of pictures, there may be different ways of understanding them. This is why GPT-4V will make judgments based on text prompts.

For example, in the following set of pictures, whether the personā€™s action is to open the door or close the door will lead to completely opposite sorting results.

Of course, through the changes in the status of the characters in multiple pictures, we can also infer what they are doing.

Or even predict what will happen next:

ā€œOn-site learningā€

GPT-4V not only has strong visual skills, but the key is that it can be learned and sold immediately.

For example, let GPT-4V read the car dashboard, and the answer initially obtained is wrong:

Then I gave the method to GPT-4V in text, but this answer is still wrong:

Then I showed the example to GPT-4V, and the answer was similar, but unfortunately the numbers were made up randomly.

Only one example is indeed a bit small, but as the number of samples increases (actually there is only one more), the hard work finally pays off, and GPT-4V gives the correct answer.

GPT-4V only shows so many effects. Of course, it also supports more fields and tasks. It is impossible to show them all here. If you are interested, you can read the original report.

So, what kind of team is behind the effects of these artifacts like GPT-4V?

Tsinghua alumni lead the way

There are 7 authors in this paper, all of whom are Chinese, 6 of whom are core authors.

The lead author of the project, Lijuan Wang, is the principal research manager of cloud computing and AI at Microsoft.

She graduated from Huazhong University of Science and Technology and received her PhD from Tsinghua University in China. She joined Microsoft Research Asia in 2006 and Microsoft Research in Redmond in 2016.

Her research field is deep learning and machine learning based on multi-modal perceptual intelligence, which specifically includes visual language model pre-training, image subtitle generation, target detection and other AI technologies.

--

--

Piyush C. Lamsoge
š€šˆ š¦šØš§š¤š¬.š¢šØ

I'm highly motivated and dedicated student of Machine Learning, Natural Language Processing, and Deep Learning.