GPT4- All Details Leaked

katerinaptrv
3 min readJul 12, 2023

The details about the best LLM model trainning and architecture and others revealed,

It was posted today by the people over at Semi-Analysis on the internet with a $100 paywall all the details behind GPT4 including: This includes model architecture, training infrastructure, inheritance infrastructure, parameter count, training data composition, token count, layer count, the multimodal, vision adaptation, etc.

But a courageous good soul called Yam Peleg(@YamPeleg) shared it all on twitter for everyone for free. The leaked information was taken down due to copyright issues but the internet is the internet and nothing ever goes completely away.

Now I am not gonna go all over all the details but make a general overview of what I find more interesting:

Model Size Comparison

  • GPT3 has 175 billion parameters.
  • Lambda has 137 billion parameters.
  • Palm Code/Minerva has 540 billion parameters.
  • Ernie has 260 billion parameters.

And GPT4, well, it has 1.8 trillion parameters across 120 layers.

Mixture of Experts (MoE)

  • GPT4 utilizes MoE, which consists of 16 different experts working together, each with ~110b parameters and trained for a specific task/field .
  • Each expert specializes in a specific task or domain.
  • MoE allows for efficient scaling up of language models without significant cost increase.

Understanding Shared Parameters for Attention

  • 55b parameters are used solely for ‘attention’, guiding the model to stay focused.
  • These shared parameters allow models to focus on important information while generalizing other details.
  • Attention in AI models is akin to the ADHD medication that helps individuals remember and focus on important details.

Training Tokens, Context Length and Batch Size in Model Training

  • It was trained on 13 trillion tokens, which means it can process a vast amount of text data.
  • Context length refers to the amount of information considered during training (8K initially, then 32K after fine-tuning).
  • Batch size gradually increases over several days during cluster training.
  • OpenAI used a batch size of 16 million by the end, with each expert seeing a subset of tokens.
  • The relationship between batch size and expert routing is not fully explained.

Hardware and Training Costs

  • OpenAI utilized around 25,000 Nvidia A100 GPUs for training.
  • The cost of using these GPUs was approximately $25,000.
  • OpenAI had 25 experts working on different parts of the model.
  • The deal with Microsoft allows them to utilize Azure’s resources, such as the A100 GPUs, for training.
  • The cost of running these A100 GPUs on the cloud is approximately $1 per hour.
  • The estimated cost to train GPT4 using today’s equivalent hardware (H100 Tensor Core GPU from Nvidia) would be around $63 million.
  • OpenAI trained GPT4 for 90 to 100 days using the A100 GPUs.
  • With the H100 GPUs, it would cost around $21 million to $22 million to train in 100 days.

Computational Resource Utilization and Failures

  • For GPT4, MFU was reported at 32% to 36%, indicating that a significant portion of computational capacity was not used effectively.
  • During training, there were numerous failures where they had to restart from a checkpoint multiple times.
  • These failures contributed to the low efficiency of using the A100 GPUs.

Cost Comparison with GPT3.5 Turbo

GPT4 costs three times more than GPT3.5 Turbo per prompt or guess.

The theory about the NERFED ChatGPT

There is speculation that a faster/cheaper model takes over after the first few words of a response, which could explain users’ complaints on this.

Here is the link for the leak original source:

For the complete review i recommend this video on youtube that was where i learned about it:

Well what did you guys think about this info and how this will affect the AI race on the future?

AI since the realese of ChatGPT in nov/22

--

--