The increasing demand for artificial intelligence models drives the need for more powerful computing hardware, and researchers are now exploring the potential of ‘Superchips’, processors integrating GPUs and CPUs on a single package. Xinyu Lian from University of Illinois Urbana-Champaign, Masahiro Tanaka from Anyscale, Olatunji Ruwase from Snowflake, and Minjia Zhang from University of Illinois Urbana-Champaign, investigate how to best utilise these new processors for training large language models. Their work addresses a critical gap in understanding how Superchips differ from traditional systems, and presents SuperOffload, a novel system designed specifically for this architecture. By combining techniques like adaptive weight offloading and a highly optimised Adam optimizer, SuperOffload achieves up to a 2. 5x throughput improvement and enables the training of significantly larger models, representing a substantial step towards more efficient and powerful AI development.

GPU CPU Offload for Large Models

Researchers have developed SuperOffload, a new technique for training extremely large language models (LLMs) that exceed the memory capacity of individual GPUs. This system intelligently manages data distribution across GPU, CPU, and potentially NVMe storage, maximizing efficiency and enabling the training of models previously considered impossible. SuperOffload was evaluated alongside established methods, demonstrating its potential as a superior approach for large-scale training.

Adaptive Weight Offloading for Grace Hopper Superchips

Scientists have pioneered SuperOffload, a system designed to fully utilise the capabilities of NVIDIA’s Grace Hopper Superchips for large language model (LLM) training. Recognizing the limitations of existing offloading techniques when applied to the Superchip’s high-bandwidth interconnect, researchers developed a system tailored to its 900 GB/s bandwidth. The core of SuperOffload lies in adaptive weight offloading, dynamically determining the optimal location for model weights based on computational needs, combined with a fine-grained data repartitioning strategy. To further enhance performance, the team incorporated speculative execution and a specialised Adam optimiser for the Grace CPU.

SuperOffload unlocks Grace Hopper’s LLM training potential

The introduction of NVIDIA’s Grace Hopper Superchip, integrating a Hopper GPU and Grace CPU with a 900 GB/s bandwidth, represents a significant advancement in AI hardware. Researchers have developed SuperOffload to fully harness this potential for LLM training. Achieving up to a 2. 5x throughput improvement compared to state-of-the-art offloading systems, SuperOffload demonstrates a substantial advancement in training efficiency. Experiments reveal that SuperOffload enables the training of a 25 billion parameter model on a single Superchip, exceeding the capacity of GPU-only solutions by a factor of seven.

Extending the system with ZeRO-style data parallelism allows for training a 50 billion parameter model using only four Superchips, a 2. 5x increase over existing parallel training methods. The team also developed SuperOffload-Ulysses, which supports long-sequence training, achieving 55% multi-factor utilisation while training a 13 billion parameter model with sequences up to one million tokens on eight GH200 Superchips.

Superchip Training Speeds Up With SuperOffload

Researchers have presented SuperOffload, a system designed to optimise LLM training on Superchips, a new generation of AI hardware integrating GPUs and CPUs. Through detailed analysis, the team identified performance limitations when applying existing offloading techniques to this architecture, and subsequently developed SuperOffload to more efficiently utilise the combined resources of Hopper GPUs, Grace CPUs, and NVLink-C2C interconnects. This advancement enables the training of substantial 25 billion parameter models on a single Superchip, and facilitates training a 13 billion parameter model with exceptionally long sequence lengths, up to one million tokens, using only eight Superchips. The team also reports achieving 55% multi-factor utilisation, a key metric for efficient hardware usage.

👉 More information
🗞 SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips
🧠 ArXiv: https://arxiv.org/abs/2509.21271

Tags:

adaptive weight offloading DeepSpeed-Ulysses GH200 Grace CPU Hopper GPU LLM Training NVLink-C2C offloading Superchips ZeRO data parallelism

Superoffload Unlocks 2.5x Faster LLM Training on Superchips with Hopper GPUs and Grace CPUs

GPU CPU Offload for Large Models

Adaptive Weight Offloading for Grace Hopper Superchips

SuperOffload unlocks Grace Hopper’s LLM training potential

Superchip Training Speeds Up With SuperOffload

Rohail T.

Latest Posts by Rohail T.:

AI Swiftly Answers Questions by Focusing on Key Areas

Machine Learning Sorts Quantum States with High Accuracy

Framework Improves Code Testing with Scenario Planning