Ayush Raj
2 min readJul 12, 2023

GPT-4 Details leaked

It’s over GPT-4 details are leaked.

Scale and Architecture-

With approximately 1.8 trillion parameters spread across 120 layers, GPT-4 demonstrates massive scale, exceeding GPT-3’s parameter count by over 10 times. To rein in computational costs of a dense model, OpenAI implements a Mixture of Experts (MoE) architecture, dividing GPT-4 into 16 smaller expert models of around 111 billion parameters each. During inference, only 2 of these experts activate, requiring just 560 TFLOPs instead of the full 3,700 TFLOPs a dense 1.8 trillion parameter model would need.

Training Process-

Training data totaled 13 trillion tokens compiled from diverse corpora like CommonCrawl plus custom in-house datasets. To handle batches at this scale, GPT-4 leveraged advanced parallelism techniques. Specifically, 8-way tensor parallelism distributed batches across GPUs, while 15-way pipeline parallelism propagated batches through stages. Running on approximately 25,000 A100 GPUs, training lasted 90-100 days and likely incurred costs up to $63 million dollars based on cloud computing rates.

Inference Costs and Optimization-

Despite optimizations like MoE, GPT-4’s per-query pricing remains about 3 times costlier than the smaller 175 billion parameter Davinci model. This stems largely from the heightened infrastructure demands of GPT-4, including bigger server clusters with lower utilization rates. Speculative decoding shows promise for reducing costs, using a smaller model to predict tokens that are then verified by the full GPT-4 in batches, accelerating throughput.

Multimodal Capabilities-

In addition to text, GPT-4 incorporates separate vision encoder models to enable multimodal applications. The architecture builds off models like Flamingo. After pre-training the text foundation, GPT-4 underwent further fine-tuning on 2 trillion visual tokens. This allows capabilities like reading web pages and transcribing information from images and videos.

Mixture of Experts Tradeoffs-

While bringing training costs down, MoE introduces tradeoffs around inference. With only subsets of experts active at once, utilization rates suffer as unused sections remain dormant. Research shows more experts can minimize loss yet overspecialize model components. Potentially sacrificing some accuracy, OpenAI conservatively opted for just 16 experts to improve generalization.

Questions Around Training Data-

While GPT-4 trained partly on known datasets like CommonCrawl, its full training corpora contains gaps. Unverified rumors suggest parts originated from restricted sources like research paper repositories and code hosting sites. GPT-4’s broad academic knowledge has fueled speculation that textbook content comprised a significant share, synthesized into instructional prompts. More diversity likely remains key to further gains.