The increasing demand for long-context large language models presents significant challenges due to the computational complexity of processing extensive data, particularly within the self-attention mechanism. Yida Wang from Capital Normal University, alongside Ke Hong, Xiuhong Li, Yuanchao Xu, Wenxun Wang from Tsinghua University, and Guohao Dai from Infinigence-AI and Shanghai Jiao Tong University, addresses this issue with a novel approach to sequence parallelism. Their research introduces TASP, a topology-aware method that fundamentally improves communication efficiency by aligning data transfer with the architecture of modern accelerators. By decomposing both the accelerator topology and the underlying communication primitives, TASP unlocks significantly greater bandwidth utilization, achieving substantial speedups, up to 3. 58x faster than existing methods like Ring Attention, on NVIDIA H100 and MI300X systems and paving the way for more powerful and efficient long-context language models.
Long Sequence Transformer Training via Partitioning
Researchers introduce TASP (Topology-Aware Sequence Partitioning), a novel approach to parallelizing the training of long-sequence Transformers. The core challenge addressed is the memory bottleneck encountered when training Transformers with very long sequences, which limits batch size and slows down training. TASP partitions the input sequence into smaller segments and distributes these across multiple GPUs, crucially optimizing this partitioning based on the network topology and interconnect bandwidth to minimize communication overhead during attention computation. Unlike some methods requiring contiguous segments, TASP allows for non-contiguous partitioning, providing flexibility in workload balancing and communication minimization. This topology awareness differentiates TASP from existing methods, and the team demonstrates that it achieves significant speedups and memory savings, increasing throughput and improving scalability.
Topology-Aware Sequence Parallelism For Language Models
Researchers developed TASP, a topology-aware sequence parallelism method, to address communication bottlenecks in long-context large language models. The study recognized that existing sequence parallelism techniques, like Ring Attention, suffer from inefficient communication due to a mismatch between the communication method and the underlying topology of modern accelerators. To overcome this limitation, the team drew inspiration from Hamiltonian decomposition of complete directed graphs, identifying that modern accelerator topologies can be decomposed into multiple orthogonal ring datapaths. This decomposition allows for concurrent data transfer without interference, maximizing communication bandwidth, and the team observed that the Ring AllGather primitive itself can also be decomposed into concurrent ring-style data transfers.
This insight formed the basis of TASP, which fully utilizes the communication capacity of accelerators through both topology and primitive decomposition. Experiments on NVIDIA H100 and AMD MI300X systems demonstrate that TASP achieves up to a 3. 58x speedup compared to Ring Attention and its variant, Zigzag-Ring Attention, demonstrating substantial improvement in communication efficiency and scalability.
TASP Accelerates Long Sequence Processing Significantly
Scientists have developed a new method, TASP, to significantly improve the performance of large language models processing very long sequences of text. Current models struggle with long contexts due to the quadratic complexity of the self-attention mechanism. The team addressed this challenge by optimizing how data is distributed across multiple accelerators. Experiments reveal that existing sequence parallelism methods, like Ring Attention, underutilize the communication capacity of modern accelerator systems, limiting their efficiency. TASP overcomes this limitation through topology decomposition, inspired by the Hamiltonian decomposition of graphs, and primitive decomposition, allowing the system to break down the communication network into multiple independent pathways, enabling concurrent data transfer without interference.
Measurements confirm that TASP fully utilizes the communication capacity of modern accelerators, achieving a speedup of up to 3. 58 compared to Ring Attention and its variant, Zigzag-Ring Attention. Detailed analysis on AMD MI300X systems demonstrates that TASP achieves a compute-to-communication ratio (CCR) of 0. 39 with a sequence length of 10K, increasing to 1. 17 with a sequence length of 100K, indicating a more balanced workload and enhanced performance.
Specifically, with a batch size of 48, TASP achieves speedups of 2. 4x, 1. 8x, 1. 5x, and 1. 3x for sequence lengths of 10K, 20K, 40K, and 50K, respectively.
Topology and Primitive Decomposition for Faster Training
Researchers have developed a new method, termed TASP, to improve the efficiency of training large language models that process extensive amounts of text. Current approaches struggle with the computational demands of handling long sequences due to the way information is processed within the self-attention mechanism. TASP addresses this by optimizing how data is communicated between processors, leveraging the underlying architecture of modern AI accelerators to maximize communication bandwidth. The team achieved this through a two-step process involving topology decomposition and primitive decomposition, effectively tailoring data transfer to the hardware’s capabilities.
Experimental results demonstrate that TASP significantly improves communication efficiency compared to existing methods like Ring Attention, achieving speedups of up to 3. 58times. While the benefits of TASP diminish when overall system latency becomes dominant, the research highlights the importance of aligning communication strategies with hardware architecture for optimal performance. The authors acknowledge that combining TASP with computation-oriented optimizations, such as sparse attention, could yield further improvements.
👉 More information
🗞 TASP: Topology-aware Sequence Parallelism
🧠 ArXiv: https://arxiv.org/abs/2509.26541
