Superoffload Unlocks 2.5x Faster LLM Training on Superchips with Hopper GPUs and Grace CPUs

The increasing demand for artificial intelligence models drives the need for more powerful computing hardware, and researchers are now exploring the potential of ‘Superchips’, processors integrating GPUs and CPUs on a single package. Xinyu Lian from University of Illinois Urbana-Champaign, Masahiro Tanaka from Anyscale, Olatunji Ruwase from Snowflake, and Minjia Zhang from University of Illinois Urbana-Champaign, investigate how to best utilise these new processors for training large language models. Their work addresses a critical gap in understanding how Superchips differ from traditional systems, and presents SuperOffload, a novel system designed specifically for this architecture. By combining techniques like adaptive weight offloading and a highly optimised Adam optimizer, SuperOffload achieves up to a 2. 5x throughput improvement and enables the training of significantly larger models, representing a substantial step towards more efficient and powerful AI development.

GPU CPU Offload for Large Models

Researchers have developed SuperOffload, a new technique for training extremely large language models (LLMs) that exceed the memory capacity of individual GPUs. This system intelligently manages data distribution across GPU, CPU, and potentially NVMe storage, maximizing efficiency and enabling the training of models previously considered impossible. SuperOffload was evaluated alongside established methods, demonstrating its potential as a superior approach for large-scale training.

Adaptive Weight Offloading for Grace Hopper Superchips

Scientists have pioneered SuperOffload, a system designed to fully utilise the capabilities of NVIDIA’s Grace Hopper Superchips for large language model (LLM) training. Recognizing the limitations of existing offloading techniques when applied to the Superchip’s high-bandwidth interconnect, researchers developed a system tailored to its 900 GB/s bandwidth. The core of SuperOffload lies in adaptive weight offloading, dynamically determining the optimal location for model weights based on computational needs, combined with a fine-grained data repartitioning strategy. To further enhance performance, the team incorporated speculative execution and a specialised Adam optimiser for the Grace CPU.

SuperOffload unlocks Grace Hopper’s LLM training potential

The introduction of NVIDIA’s Grace Hopper Superchip, integrating a Hopper GPU and Grace CPU with a 900 GB/s bandwidth, represents a significant advancement in AI hardware. Researchers have developed SuperOffload to fully harness this potential for LLM training. Achieving up to a 2. 5x throughput improvement compared to state-of-the-art offloading systems, SuperOffload demonstrates a substantial advancement in training efficiency. Experiments reveal that SuperOffload enables the training of a 25 billion parameter model on a single Superchip, exceeding the capacity of GPU-only solutions by a factor of seven.

Extending the system with ZeRO-style data parallelism allows for training a 50 billion parameter model using only four Superchips, a 2. 5x increase over existing parallel training methods. The team also developed SuperOffload-Ulysses, which supports long-sequence training, achieving 55% multi-factor utilisation while training a 13 billion parameter model with sequences up to one million tokens on eight GH200 Superchips.

Superchip Training Speeds Up With SuperOffload

Researchers have presented SuperOffload, a system designed to optimise LLM training on Superchips, a new generation of AI hardware integrating GPUs and CPUs. Through detailed analysis, the team identified performance limitations when applying existing offloading techniques to this architecture, and subsequently developed SuperOffload to more efficiently utilise the combined resources of Hopper GPUs, Grace CPUs, and NVLink-C2C interconnects. This advancement enables the training of substantial 25 billion parameter models on a single Superchip, and facilitates training a 13 billion parameter model with exceptionally long sequence lengths, up to one million tokens, using only eight Superchips. The team also reports achieving 55% multi-factor utilisation, a key metric for efficient hardware usage.

👉 More information
🗞 SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips
🧠 ArXiv: https://arxiv.org/abs/2509.21271

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Black Hole Shadows Grow with Dark Matter and Cosmic Strings, Simulations Reveal

Black Hole Shadows Grow with Dark Matter and Cosmic Strings, Simulations Reveal

February 6, 2026
New Material Superconducts at 60.8 Kelvin, Potentially Revolutionising Energy Transmission

New Material Superconducts at 60.8 Kelvin, Potentially Revolutionising Energy Transmission

February 6, 2026
New Magnetic Textures Could Shrink Future Data Storage Devices Dramatically

New Magnetic Textures Could Shrink Future Data Storage Devices Dramatically

February 6, 2026