Nvidia vs Intel: Analyzing Intel’s AI Acceleartor Gaudi 3

GPUnet
4 min readJun 13, 2024

--

Finally, Intel is doing something really good for the first time in AI, which might make them a strong rival to Nvidia. In April, at Intel’s Vision 2024 event in Phoenix, Arizona, the company revealed the initial architectural details of its third generation AI accelerator, Gaudi 3.

In earlier versions of Gaudi, there were some criticisms, which they say have been addressed in the newest version. With Gaudi 3, the focus is on its performance with LLMs, where it claims to be significantly better. However, there’s alot of talks about Nvidia’s upcoming GPU the Blackwell B200, in the background. This newest version of Gaudi is set to launch in the third quarter of 2024, and Intel is already sending samples to customers.

Gaudi Chip Generations: A Side-by-Side Comparison

Gaudi 3 builds on its predecessor Gaudi 2’s foundation, even taking it a step further in some aspects. Instead of Gaudi 2’s single chip setup, Gaudi 3 consists of two identical silicon dies connected by a high-speed link, literally doubling down on its predecessor’s architecture. Each die features a 48-megabyte cache memory in its central area. Around this core, there are four engines dedicated to matrix multiplication and 32 programmable units known as tensor processor cores. These components are interconnected with memory, and the chip is topped off with media processing and network infrastructure.

Intel has advanced from TSMC’s 7nm process used in Gaudi 2 to the newer 5nm process for Gaudi 3. This shift has allowed for some hardware enhancements, with Gaudi 3 now boasting 4 Matrix Math Engines and 32 Tensor cores, up from 2 Matrix Math Engines and 24 Tensor Cores in Gaudi 2. Despite these changes, it’s presumed that the tensor cores in Gaudi 3 remain similar to those in Gaudi 2, still being 256 byte-wide VLIW SIMD units.

According to Intel, this configuration results in double the AI compute power compared to Gaudi 2, leveraging the efficiency of 8-bit floating point infrastructure, which is crucial for training transformer models. Moreover, computations using the BFloat16 number format see a fourfold increase in performance.

Gaudi 3 LLM Performance, comparing with Nvidia Hopper Series

Throughout the lifespan of the Gaudi accelerators, Intel has chosen to emphasize the performance of the chips rather than solely focusing on specifications, and this strategy remains consistent with Gaudi 3. With most attendees at Vision being business clients, Intel hopes to impress with performance figures based on benchmarks that showcase what Gaudi 3 can achieve.

It’s worth noting that the Gaudi team has taken a direct approach against NVIDIA by using their own benchmarks and results. Essentially, Intel’s performance figures for Gaudi 3 are compared against NVIDIA’s own reported figures, avoiding any biased scenarios against NVIDIA. However, it’s important to understand that these figures are projections and not actual measurements from assembled systems (it’s unlikely Intel has 8192 Gaudi 3s available for testing).

In comparison to H100, Intel suggests that Gaudi 3 could surpass it by up to 1.7 times in training Llama2–13B in a 16-accelerator cluster at FP8 precision. Even though H100 is nearly two years old, outperforming it significantly in training would be a notable achievement for Intel if proven true.

Additionally, Intel anticipates Gaudi 3 delivering 1.3 to 1.5 times the inference performance of H200/H100, notably with up to 2.3 times the power efficiency.

However, the details matter. Intel occasionally falls short of H100 in certain inference workloads, especially those without 2K outputs, so Gaudi 3 doesn’t achieve complete dominance. Moreover, there are other benchmark results that Intel doesn’t highlight.

To Intel’s credit, they are one of the few major hardware manufacturers providing MLPerf results lately. Regardless of Gaudi 3’s actual performance (and Gaudi 2’s current performance), Intel has been transparent in publishing results for industry standard tests.

AND when we think about Moore’s Law, the main question is what technology the next version of Gaudi, named Falcon Shores will use. Until now, the product has used TSMC technology while Intel develops its foundry business. However, next year Intel will start offering its 18A technology to foundry customers and will already be using 20A internally. These two advancements bring the next generation of transistor technology called nanosheets, with power delivery from the backside. TSMC isn’t planning to use this combination until 2026.

Our Official Channels:

Website | Twitter | Telegram | Discord

--

--

GPUnet

Decentralised Network of GPUs. A universe where individuals can contribute their resources & GPU power is democratised.