GPT-4 Technical Report

Stay Hard
3 min readJun 10, 2023

--

Predictable Scaling

This technique is used to predict the performance and ability of the GPT-4 model. In my understanding, they train the model with much less computation and evaluate the model on different benchmarks/tasks to predict the final performance of the trained GPT-4 model.

The following figure shows that such a method can predict the final loss of the GPT-4 model quite accurately.

But some capabilities are quite hard to predict, in the following figure, we can see that GPT-4 reverses the trend.

I personally have never heard of such a technique. But they believe it is quite important to do that. I guess they can use that to predict the effectiveness of GPT-4 in advance of training it at a low cost to

Capabilities

Simulating Exams Designed for Humans

GPT-4 exhibits human-level performance on the majority of these professional and academic exams

Traditional Benchmarks Designed for Language Models

For most of the benchmarks, GPT-4 can outperform the LM SOTA and SOTA techniques, except for DROP, which is a benchmark to evaluate the ability of reading comprehension and arithmetic.

Also, interestingly, the researchers translate the MMLU benchmark into multiple different languages, and found that the performance of GPT-4 on the benchmark that of other languages can outperform other techniques on the original benchmark.

Visual Inputs

GPT-4 can process the input of images and text and output text. Researchers found that its ability for images is similar to text inputs. The test-time techniques developed for language models (e.g, few-shot prompting, chain-of-thought) are similarly effective when suing both images and text.

Limitation

But GPT-4 also shares similar limitations to GPT-3, such as hallucination, limited context window, and does not learn from experience.

But GPT-4 significantly reduces hallucinations compared to GPT-3.5 models.

GPT-4 achieves a better result on a TruthfulOA benchmark, which tests the ability to separate fact from incorrect statements after RLHF post-training.

Researchers also put lots of effort into making GPT-4 not respond to malicious/inappropriate prompts.

--

--