Unleashing the Power of GPT-4: A Comprehensive Analysis

Rohit Vincent
Version 1
Published in
9 min readApr 4, 2023

GPT-4 was released recently by OpenAI claiming to be the best model language out there. In this article, we take a closer look at the technical details released by OpenAI describing what makes GPT-4 better than models. Are there alternatives to GPT-4 when it comes to specific tasks?

Generated using Midjourney

Find out more about GPT-3.5 here and GPT-4 here.

What is GPT-4?

As defined by OpenAI, GPT-4 is a Transformer model pre-trained to predict the next token in a document. It was trained on both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF). That means human inputs were used as feedback for the model to differentiate what is right or wrong. So, what can GPT-4 do?

Remembers conversations better.

The GPT-4 context length has doubled from GPT 3.5. GPT-4 has a base version of 8,192 tokens and also a 32,768-context length model. What does this mean? You can now process around 50 pages of text compared to what GPT3.5 could do which is around 7 pages.

If you wanted to write a story from a description of a page, GPT 3.5 could generate 6 pages of the story whereas GPT-4 could generate up to 49 pages.

What this also means is that the GPT-4 model could remember around 49 pages of a previous conversation to answer a question.

In terms of training, the majority of data used to train GPT-4 was based on events up till September 2021 which was the same data cut-off GPT-3.5 has. So if you use GPT-4 to generate information based on events that occurred past this date, it would most likely be made up.

Better at some Exams than GPT 3.5.

GPT-4 has vastly improved in exams such as Biology, statistics, and LSATs with scores achieved above the top 80% of test takers. However, we can see GPT-4 doesn’t really perform well in code exams such as Leetcode.

GPT-4 scored a 3 out of 45 in the Leetcode hard exam which does suggest that out-of-the-box code generation would still not be 100% perfect without manual intervention or repeated prompting. This should mean if you want to write, convert or explain code, using GPT-4 might prove more efficient. Also, based on the chart below, GPT-4 has improved substantially compared to GPT-3.5 in coding exams.

Exams where GPT-4 has Improved from GPT-3.5 (Source: GPT-4 Technical Paper)

The below chart shows areas in which there is no improvement in GPT-4 compared to GPT-3.5. It is interesting to see that scores for GRE Writing, English Language/Literature, and Composition have stayed the same, whereas the previous chart depicts a drastic improvement in the GRE Quantitative Exam which is more of a reasoning task. This does back up what OpenAI states that GPT-4 could be used for more complicated reasoning tasks rather than just language generation since GPT-3.5 does that pretty well.

Exams where GPT-4 has no improvement from GPT-3.5 (Source: GPT-4 Technical Paper)

GPT-4 when benchmarked with the best state-of-the-art models out there outperforms most models in the areas such as Commonsense reasoning, Python coding tests, and Grade-School Mathematics as shown in the chart below.

Academic Benchmarks for GPT-4, GPT-3.5, and State-of-the-Art Models (Source: GPT-4 Technical Paper)

It is worth pointing out that GPT-4 may not always be the best solution out there, for example, in terms of Reading, Comprehensive & Arithmetic Benchmark, the state-of-the-art model QDGAT slightly beats GPT-4 by scoring an F1-score of 88.4 whereas GPT-4 scored 80.9.

GPT-4 Remembers!

Another improvement in GPT-4 which implies it has learned reasoning abilities is that it scores 100% on Hindsight Neglect.

Extracted from the GPT-4 Technical Paper. Y-Axis shows accuracy with higher values being better.

What is Hindsight Neglect?

Let’s look at this in terms of a poker game. Hindsight neglect could happen when a player looks back at a hand they played and forgets how uncertain or nervous they were at the time. For example, if a player made a risky bet and ended up winning the hand, they might look back and think it was a great decision because they won. But if they had lost the hand, they might remember feeling really unsure about the bet, even if it was the same bet they made when they won. GPT-3.5 has hindsight neglect which means it forgets the uncertainty when the outcome is positive whereas GPT-4 has better reasoning skills and does not choose to play a hand which is risky even if you won.

Examples of what GPT-4 gives us and what GPT-3.5 generates on the hindsight neglect dataset are shown through ChatGPT Plus below:

The Question: David has the option to play a game with a 94% chance to lose 50$ and a 6% chance to win 5$. David plays and wins 5$. Should he have played?

What is the Answer: Based on hindsight even if David had won, that was a very risky game and the answer should be No. GPT-4 get this right!

GPT 3.5 Answers Yes
GPT-4 Answers No

GPT-4 is a multilinguist

GPT-4 beats GPT-3.5 on English accuracy scores by around 15 percentage points. But GPT-4 is not only good at English, it has the capability to process or understand multiple languages.

Other languages GPT-4 is good at. Sourced from the GPT-4 Technical Paper

GPT-4 is great at these languages shown in the world cloud above. The accuracy scores for all languages mentioned above are better than the accuracy scores of earlier models such as GPT-3.5, Palm, and Chinchilla models just in English.

GPT-4 can do more than read. It can really see.

Academic Vision Benchmarks

We don’t have access to the version of GPT-4 which accepts images at the moment so this analysis is completely based on OpenAI findings. The results do show GPT-4 is really good at Infographics and other image-based reasoning skills.

An example of GPT-4’s visual capabilities is shown below:

Input into GPT-4 (source)

GPT-4 has the ability to analyse the above images as input and provide an answer. GPT-4 generated the below output for the prompt shown in the image above (source):

To find the sum of average daily meat consumption for Georgia and Western Asia, we will follow these steps:

1. Identify the average daily meat consumption for Georgia.
2. Identify the average daily meat consumption for Western Asia.
3. Add the values from steps 1 and 2.

Step 1: From the provided chart, we see that the average daily meat consumption for Georgia is 79.84 grams per person per day.
Step 2: From the chart, we see that the average daily meat consumption for Western Asia is 69.62 grams per person per day.
Step 3: Add the values from steps 1 and 2.
79.84 g (Georgia) + 69.62 g (Western Asia) = 149.46 g

So, the sum of average daily meat consumption for Georgia and Western Asia is 149.46 grams per person per day.

Limitations of GPT-4

Fewer Hallucinations

Just like every language model out there GPT-4 does hallucinate. But it hallucinates less than other models out there. GPT-4 was fine-tuned based on data from previous models such as ChatGPT to reduce hallucinations. GPT-4 scores 19 percentage points higher than the GPT-3.5 model at avoiding open-domain hallucinations, and 29 percentage points higher at avoiding closed-domain hallucinations.

Closed-domain hallucinations occur when a language model generates text that is inconsistent within a specific domain, such as a specific topic or subject matter. Open-domain hallucinations are real-world data that had been flagged as not being factual. They occur when a language model generates text that is factually incorrect or unsupported in any domain.

Remember the thumbs up or down for each answer in ChatGPT? This would have been used to train GPT-4 to answer better.

Feedback from Users (Thumbs Up or Down)

Not a Doctor or a Lawyer

Language models could generate wrong information or have inbuilt biases. GPT-4 like other language models continue to reinforce social biases and world views. OpenAI suggests careful evaluation of performance across different groups in a context where informed decision-making is required. To this extent, you cannot use the GPT-4 model for any high-risk government decision-making or offering legal or health advice.

Not Skynet Yet.

Generated using Midjourney

A really interesting subject OpenAI mentions is that large language models such as GPT-4 tend to show emergent behaviours like “power-seeking” which means the model wants access to more information. Power seeking is currently studied as an effective strategy for model improvement. Given the risks associated with this OpenAI, in partnership with Alignment Research Center(ARC) did a basic evaluation of GPT-4 early models. The team from ARC basically hooked up GPT-4 with the ability to execute code, do chain-of-thought reasoning, and delegate to copies of itself. It didn’t perform great as per their evaluation, especially when asked to self replicate but certainly one to keep an eye out for in the future!

What should you use GPT-4?

This is a question that is very specific to the task at hand. Tasks that require complex reasoning would be better in GPT-4 but if it is just for generating content GPT 3.5 would be more efficient cost-wise. GPT-4 costs 10 times more than GPT-3.5, additional details on costing could be found here. Tasks that require higher context length or image analysis could only be achieved in GPT-4 with its visual input capabilities and 32,768 token limit.

As per the Technical Paper, a user-based evaluation of prompts and outputs generated by both GPT 3.5 and GPT-4 suggested that 70% of users preferred GPT-4. This shows that GPT 3.5 is not out of the game with around 30% percent of the users still preferring the output that it gives.

GPT-4 shows great promise, especially with Microsoft integrating GPT-4 with their Bing Search engine. More interesting use cases are out there such as Github’s new Copilot X which is an AI-powered auto-completion for code, and Microsoft-365 Co-pilot which writes, edits, summarizes, and creates for you in Microsoft applications such as word or excel. GPT-4 is also being used in accessibility areas through companies like Be My Eyes.

Here is a video of how the technology is used.

Generated using Midjourney

We at Version 1 are currently working with GPT-4 and also analysing other language models including Google Bard. Stay tuned for more updates and visit the Innovation Labs to find out what Version 1 can do for you.

--

--