NVIDIA presents brand new generation of GPUs! A deep dive

GPUnet
5 min readMar 22, 2024

--

This week Nvidia revealed its newest GPU lineup during GTC 2024, which they’ve named Blackwell. It highlights multiple high level alterations in hardware architecture, design, and pricing.

The initial release from the Blackwell line called GB200, is anticipated to be available later this year. With huge success of Nvidia’s “Ampere” series & “Hopper” series this next generation chips with 2 reticle-sized GPUs i.e the Blackwell gen is set to take the stage this year.

In 2023, Nvidia held an impressive 82% share of the GPU market. The company’s revenue skyrocketed by 265% compared to the previous year, primarily driven by robust sales of AI chips designed for servers, notably the “Hopper” series like the H100. The latest chips have the capability to drive revenue to extraordinary levels

B100 Configuration

The B100 is a discrete accelerator designed to boost computing performance significantly. It uses an 8Gbps HBM3E memory clock, ensuring fast data processing. With a memory bandwidth of 8TB per second, it can handle large amounts of data efficiently. The device comes with 192GB of VRAM, split into two 96GB units, making it capable of managing complex tasks.

For connectivity, the B100 includes NVLink 5, supporting data transfers up to 1800GB per second, and PCIe 6.0, which adds another 256GB per second. This ensures seamless communication with other components. The accelerator is built with 208 billion transistors (2x104B).

Operating with a total power draw of 700W, the B100 is relatively energy efficient, thanks to the TSMC 4NP manufacturing process. It uses the SXM-Next interface for connectivity and is based on the Blackwell architecture, making it a powerful tool for enhancing computing operations with high efficiency and performance. The B100 is compatible with air cooling, offering flexibility in deployment without the necessity for advanced cooling systems.

B200 Configuration

The B200 and B100 models are remarkably similar in their specifications and performance, both engineered as discrete accelerators for advanced computing tasks. Each is equipped with an 8Gbps HBM3E memory clock, ensuring rapid data processing, and boasts a memory bandwidth of 8TB per second, paired with 192GB of VRAM divided into two 96GB units. This configuration is adept at handling complex tasks that require extensive data processing.

In terms of Tensor operations, essential for Mathematical processing and deep learning tasks, 9 PFLOPS for FP4 Dense Tensor calculations, 4.5 P(FL)OPS for INT8/FP8 Dense Tensor, 2.2 PFLOPS for FP16 Dense Tensor, 1.1 PFLOPS for TF32 Dense Tensor, and 40 TFLOPS for FP64 Dense Tensor operations. This shows that in terms of Tensor processing, B200 has clear advantage over B100. The B200’s enhanced performance in these areas makes it a preferable choice for those seeking higher efficiency and power in advanced computing tasks, especially in AI and deep learning fields.

Both models also feature robust connectivity options, including NVLink 5 at 1800GB/sec and PCIe 6.0 at 256GB/sec, ensuring seamless data transfer capabilities. With a GPU transistor count of 208 billion (2x104B), they demonstrate the high level of technological sophistication embedded in their design. Despite their powerful performance, they maintain a total power consumption of 1000W, reflecting their energy-efficient design achieved through the TSMC 4NP manufacturing process.

Employing the SXM-Next interface and built on the Blackwell architecture, the B200 and B100 are designed to meet the demands of cutting-edge computing tasks. Therefore, when comparing these two models, it’s evident that their similarities extend across most technical aspects, including their Tensor flops performance, indicating that neither model significantly surpasses the other in terms of processing power for AI and deep learning applications.

GB200 Configuration

The GB200 features the Grace Blackwell Superchip, running at a fast 8Gbps with HBM3E memory clock for efficient data processing. It contains a huge memory bandwidth of 2x8TB per second, capable of handling large data volumes. With 384GB of VRAM configured as 2x2x96GB, it’s prepared for demanding high level computing tasks.

Connectivity is robust, including 2x NVLink 5 interfaces for 1800GB per second data transfer and 2x PCIe 6.0 connections adding 256GB per second. The core of the GB200 consists of two “Blackwell GPUs”, each with 416 billion transistors, indicating high complexity and power. Despite its capabilities, it manages a total power draw of 2700W, again due to the powerful TSMC 4NP manufacturing process.

Given its high TDP of 2700W, liquid cooling is a must for the GB200, ensuring it operates efficiently and safely under heavy loads.

This uses a Superchip interface and the Grace + Blackwell architecture, making it a powerful tool for advanced computing needs, offering unmatched performance and efficiency.

Jensen, CEO of Nvidia described the senario, the efficiency and power of different GPUs in training a GPT-4 model with 1.8 trillion parameters are highlighted. Training such a model on A100 GPUs would require a massive setup of 25,000 units and take approximately 3 to 5 months to complete. In contrast, utilizing H100 GPUs for the same task significantly reduces the hardware requirement to 8,000 GPUs, with the training time narrowing down to around 3 months. The B100 GPUs further optimize this process, needing only 2,000 GPUs to achieve the same goal within a similar timeframe of about 3 months. This comparison determines the potency of this GPU technology, showcasing the B100’s superior efficiency and capability in handling extensive AI model training tasks within a condensed period.

Our Official Channels:

Website | Twitter | Telegram | Discord

--

--

GPUnet

Decentralised Network of GPUs. A universe where individuals can contribute their resources & GPU power is democratised.