Intel’s Gaudi 3 AI Chip Promises 40% Faster Training, Outperforming Nvidia’s H100 – Say IEEE

Intel'S Gaudi 3 Ai Chip Promises 40% Faster Training, Outperforming Nvidia'S H100 - Say Ieee

Intel has revealed the architectural details of its third-generation AI accelerator, Gaudi 3, at its Vision 2024 event. The new chip doubles the AI compute of its predecessor, Gaudi 2, and offers a fourfold boost for computations using the BFloat 16 number format. Intel claims Gaudi 3 can train the GPT-3 175B large language model 40% faster than Nvidia’s H100 chip. The company also projects significant power efficiency improvements. However, Nvidia’s next GPU, the Blackwell B200, is expected to launch later this year. Intel’s Habana Labs COO, Eitan Medina, highlighted the importance of power efficiency in data centers.

Intel’s Gaudi 3: A New Contender in AI Accelerator Chips

As many might believe, the world of AI accelerator chips is not solely dominated by Nvidia. Intel, a major player in the tech industry, recently unveiled the architectural details of its third-generation AI accelerator, Gaudi 3, at the Vision 2024 event in Phoenix, Arizona. This new chip is Intel’s latest attempt to compete with Nvidia’s top chip, the H100, and its upcoming GPU, the Blackwell B200.

The Evolution of Gaudi Architecture

Gaudi 3 builds upon the architecture of its predecessor, Gaudi 2, in several significant ways. Unlike Gaudi 2, which was a single chip, Gaudi 3 consists of two identical silicon dies connected by a high-bandwidth link. Each die features a central region of 48 megabytes of cache memory, surrounded by four engines for matrix multiplication and 32 programmable units known as tensor processor cores. These cores are then encased by memory connections, with media processing and network infrastructure at one end.

According to Intel, this design allows Gaudi 3 to deliver double the AI compute of Gaudi 2 using 8-bit floating-point infrastructure, which is crucial for training transformer models. Additionally, it provides a fourfold increase in computations using the BFloat 16 number format.

Gaudi 3’s Performance in Large Language Models (LLMs)

Intel’s projections indicate that Gaudi 3 can train the GPT-3 175B large language model 40 percent faster than the H100. The chip also shows promising results for the 7-billion and 8-billion parameter versions of Llama2. In terms of inferencing, Gaudi 3 delivered 95 to 170 percent of the H100’s performance for two versions of Llama. For the Falcon 180B model, Gaudi 3 achieved up to a fourfold advantage.

When it comes to power efficiency, Intel’s new chip outperforms the H100 by a significant margin, delivering up to 220 percent of the H100’s value on Llama and 230 percent on Falcon. Eitan Medina, chief operating officer of Intel’s Habana Labs, attributes this efficiency to the Gaudi architecture’s large-matrix math engines, which require almost an order of magnitude less memory bandwidth to operate.

Gaudi 3 vs. Blackwell: A Speculative Comparison

While it’s difficult to compare accelerators before they’re available, there are a few points of comparison, particularly in memory and memory bandwidth. Both Gaudi 3 and Blackwell make use of high-bandwidth memory (HBM), a stack of DRAM memory dies atop a control chip. Gaudi 3 has more HBM than H100, but less than H200, B200, or AMD’s MI300. However, its memory bandwidth is superior to H100’s.

Gaudi 3 is manufactured using TSMC’s N5 (sometimes called 5-nanometer) process technology, the same process as H100 and H200. The upcoming Blackwell, however, is made on a process called N4P, which TSMC describes as being in the same 5-nm family as N5 but delivering an 11 percent performance boost, 22 percent better efficiency, and 6 percent higher density.

The Future of Gaudi: Falcon Shores

Looking ahead, the next generation of Gaudi, currently code-named Falcon Shores, is a topic of interest. While the product has so far relied on TSMC technology, Intel plans to begin offering its 18A technology to foundry customers next year and will already be using 20A internally. These two nodes bring the next generation of transistor technology, nanosheets, with backside power delivery, a combination TSMC is not planning until 2026.

More information
External Link: Click Here For More