Nvidia GTC: Early Takeaways about the Future of Chips and AI

Published in

High Tech Accessible

4 min readMar 21, 2024

Dear readers, having watched Jensen’s GTC keynote, I just wanted to share my takeaways and thoughts. Please correct me or add to the below wherever applicable:

Moore’s Law is truly dead: In the 2+ years since the announcement of Hopper H100, TSMC has been unable to sufficiently mature their new 3nm node. While used by Apple for their small mobile SoCs, clearly 3nm is not scaling up to support high yields on large chips such as the new B100 announced today, as it used the same 4NP process as H100. This is unusual. H100 sports 80B transistors in a ~800mm2 chip and B100 seems slightly larger, though still around 800mm2 with 104B transistors. 4NP has clearly matured to support slightly larger chips and density, but the traditional scaling obtained from node-shrinks is completely missing.
Nvidia finally goes in on chiplets: For the longest time, Nvidia refused to put together multiple chips into a single package, instead preferring to fab monster 700–800mm2 chips with poor yields while AMD and Intel have long adopted multi-chip module (MCM) chiplet designs to scale performance-per-package. B100 is now 2x800mm2 chips totaling ~210B transistors, an exciting change! Which brings me to my next point…
Advanced packaging will shape the future of chip-advancements: As traditional scaling of transistors slows, chip-designers will increasingly rely on advanced packaging such as multi-chip module (MCM) chiplet designs and stacking to continue performance scaling. In addition to chiplets, Nvidia’s new B100 also utilizes TSMC’s advanced 2.5D CoWoS (Chip-on-Wafer-on-Substrate) packaging to build these monstrous, super-chips in a single package. AMD already uses 3D-stacked cache and Intel is developing their own tech, named “Foveros” to compete with TSMC’s 3D packaging solutions. Expect more in the future.
Scalability via software: Not just is B100 2x chiplets, Jensen states B100 is a “platform”, wherein each board features 2xB100 packages for a total of 4 chips. This will make comparisons with H100 skewed to say the least! Regardless, their new DGX NVL72 system is insane, featuring 30TB of VRAM and many chips that’ll all be seen as one GPU by CUDA! This demonstrates how crucial advanced software is to scalability: modifying CUDA to see a massive DGX system with many multi-package chips as a single GPU with a unified 30TB of VRAM is very impressive indeed! Expect this system to cost double-digit millions though.
Classic Nvidia misdirection with misleading charts: Dramatic charts show B100 to be magnitudes faster than H100, all while they’re comparing H100’s FP16 performance to B100’s newly supported FP4-format performance! Both are completely different things, and such charts are misleading!
No mention of FP16, FP32 and FP64 performance: I saw no mention of half, full and double precision floating-point performance, which are dominant FP formats. Such avoidance has traditionally been a bad sign as it implies the gen-over-gen performance gains are unimpressive. All comparisons were instead against B100 FP8 & FP4 performance, which is a vastly different metric.
New FP formats welcome: Research shows that training in FP8 format dramatically improves performance by reducing training time while showing virtually no loss of precision despite the scale-down from the typical FP16. Native support for FP8 and even FP4 is great, but these formats are not widely used to train models today so the benefits may be a way off, which makes me even more annoyed at those charts!*
Massive memory: Needless to say, this is very welcome.

TL;DR — Nvidia is going to make even more money and AI is going to be even more scary in the very near future.

*Here’s a paper about FP8: https://arxiv.org/abs/2310.18313

Reddit thread: https://www.reddit.com/r/LocalLLaMA/comments/17jjopb/fp8lm_training_fp8_large_language_models/

Abstract:

In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 42% reduction in real memory usage but also ran 64% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 17%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at aka.ms/MS.AMP.

Nvidia GTC: Early Takeaways about the Future of Chips and AI

Written by Abheek Gulati