The increasing complexity of modern artificial intelligence demands ever more powerful hardware, but scaling up graphics processing units (GPUs) introduces a hidden challenge, non-uniform memory access (NUMA). Mansi Choudhary from Duke University, Karthik Sangaiah and Sonali Singh from Advanced Micro Devices Inc., along with their colleagues, investigate how variations in memory access speeds across different parts of a GPU significantly hinder the performance of attention-based AI models. Their work reveals that traditional scheduling methods, which assume uniform access to memory, fail to fully utilise the capabilities of next-generation GPUs. By introducing a spatially-aware scheduling strategy, dubbed Swizzled Head-first Mapping, the team aligns computational tasks with the GPU’s internal architecture, achieving up to 50% performance gains on the MI300X architecture and sustaining remarkably high cache hit rates, demonstrating a crucial step towards truly scalable and efficient AI computing.
Swizzled Mapping Optimizes Chiplet GPU Attention
This research investigates how to optimize attention mechanisms in large language models (LLMs) for modern chiplet-based GPUs, such as AMD’s MI300X, which feature disaggregated memory. The core idea is to exploit spatial locality within the attention computation to improve performance and reduce the impact of non-uniform memory access (NUMA) effects. Scientists discovered that traditional scheduling strategies struggle with NUMA challenges, leading to performance bottlenecks due to data movement between chiplets. To address this, the team proposes Swizzled Head-First Mapping, a novel data mapping strategy that reorders data access patterns to align computation with the GPU’s NUMA domains.
This maximizes data reuse within each chiplet and minimizes communication between them. Experiments demonstrate that Swizzled Head-First Mapping achieves significant performance improvements, up to 50%, compared to conventional scheduling techniques, particularly in attention scenarios with a high number of heads, such as those found in the DeepSeek-V3 model. This optimization consistently maintains high L2 cache hit rates, ranging from 80-97%, indicating effective data reuse. This approach was validated on AMD’s MI300X GPU with various attention workloads, including FlashAttention2 and the prefill phase of DeepSeek-V3. This work highlights the importance of hardware-aware algorithm design in the era of increasingly complex AI accelerators.
Aligning Attention Heads with NUMA Domains
The research addresses a critical bottleneck in modern AI systems: non-uniform memory access (NUMA) effects arising from increasingly disaggregated GPU architectures. As GPUs scale through chiplet designs, memory latency and bandwidth vary significantly across compute regions, undermining traditional scheduling strategies that assume uniform access. The team engineered a method that aligns attention heads with specific GPU NUMA domains, exploiting intra-chiplet cache reuse and minimizing cross-chiplet communication.
This involved a detailed analysis of the MI300X architecture, revealing the spatial distribution of compute units and memory controllers. Scientists then mapped attention heads to compute units residing within the same chiplet, maximizing the likelihood of cache hits and reducing memory access latency. The approach fundamentally reorders the processing of attention heads, prioritizing those located within the same NUMA domain before moving to others. Experiments employed the AMD MI300X architecture to rigorously evaluate the performance of Swizzled Head-first Mapping. The team implemented and tested their scheduling strategy across a range of MHA workloads, comparing its performance against state-of-the-art attention algorithms using conventional scheduling techniques.
Results demonstrate a significant performance improvement, achieving up to 50% higher throughput with Swizzled Head-first Mapping. Furthermore, the method sustains consistently high L2 cache hit rates, ranging from 80-97%, indicating efficient utilization of on-chip memory. This work establishes that NUMA-aware scheduling is now fundamental to achieving peak efficiency on next-generation disaggregated GPUs, paving the way for scalable AI training and inference.
Swizzled Mapping Optimizes GPU Attention Performance
The research team has demonstrated a breakthrough in optimizing attention mechanisms for modern, disaggregated GPUs, specifically addressing the challenges posed by non-uniform memory access (NUMA). As GPU designs evolve towards multi-chiplet architectures, memory latency and bandwidth vary significantly across compute regions, hindering performance. Experiments conducted on the AMD MI300X architecture reveal that this new method achieves up to 50% higher performance compared to state-of-the-art attention algorithms employing conventional scheduling techniques.
Crucially, the team consistently measured high L2 cache hit rates ranging from 80% to 97% using their approach, demonstrating substantial improvements in memory access efficiency. The MI300X, featuring eight Accelerator Complex Dies (XCDs), each with dedicated compute units, L2 cache, and memory controllers connected to independent HBM stacks, served as an ideal testbed for validating spatially-aware optimization techniques. This work builds upon prior successes in optimizing other GPU kernels for the MI300X, where architecture-aware mapping strategies previously increased L2 cache hit rates from 43% to 92%. The results confirm that NUMA-aware scheduling is now fundamental to achieving peak efficiency on next-generation disaggregated GPUs, offering a clear path forward for scalable AI training and inference workloads.
Swizzled Mapping Boosts Chiplet GPU Performance
This work demonstrates that strategic exploitation of spatial locality mitigates non-uniform memory access (NUMA) effects in modern chiplet-based GPUs. Results on AMD’s MI300X architecture show up to 50% performance improvement over state-of-the-art attention algorithms using conventional scheduling techniques, while consistently maintaining high L2 cache hit rates of 80-97%. These findings underscore the necessity of hardware-aware algorithm design as chiplet-based architectures become increasingly prevalent in AI accelerators.
👉 More information
🗞 Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects
🧠 ArXiv: https://arxiv.org/abs/2511.02132
