Scientists are continually striving to accelerate Large Language Models, and a critical component is optimising attention kernels. Researchers Yifan Zhu, Yekai Pan, and Chen Ding, all from the University of Rochester USA, have identified a key bottleneck in the memory behaviour of CuTile-based FlashAttention on NVIDIA GB10 architecture, specifically, L2 cache misses. Their new technique, termed Sawtooth Wavefront Reordering, directly addresses this issue, demonstrably reducing these misses by 50% or more and boosting throughput by up to 60% on GB10, a significant advancement that promises to unlock even greater performance in future LLMs.
The team’s investigation, conducted on the NVIDIA GB10 (Grace Blackwell) processor, pinpointed L2 cache misses as a primary performance bottleneck. Leveraging this critical insight, they developed “Sawtooth Wavefront Reordering”, a programming technique designed to minimise these L2 misses and accelerate processing.
This work begins with a rigorous analysis of CuTile-based Flash Attention’s memory access patterns using hardware counters and performance models. Researchers employed a raw CUDA implementation with carefully designed Cooperative Thread Array (CTA) scheduling to isolate the impact of memory access, eliminating confounding factors from compiler optimisations. Their analysis revealed that the L1 cache offers minimal benefit for streaming attention patterns, while L2 cache behaviour exhibits a predictable, deterministic model. Crucially, they identified a strong correlation between L2 hit rates and the number of active Streaming Multiprocessors (SMs), suggesting inherent data reuse among synchronous wavefronts, a key finding driving their innovation.
Based on these observations, the team proposed the “Sawtooth” alternating scan pattern, a method to enhance L2 cache data reuse by strategically reordering memory access. This innovative approach increases the likelihood of finding required data already present in the L2 cache, reducing the need to access slower global memory. The researchers then successfully ported this optimisation to the CuTile environment, demonstrating that the insights gained from low-level analysis directly translate into improvements within the higher-level programming model. The NVIDIA GB10, fabricated on TSMC’s 3nm process node, features 48 streaming multiprocessors and 20 ARM v9.2 CPU cores, alongside 128GB of unified LPDDR5X memory with approximately 301GB/s raw bandwidth and an aggregate bandwidth of around 600GB/s. Using the Nsight Compute CLI, the team meticulously measured performance metrics like total sectors requested and cache hit rates, confirming the effectiveness of their technique. The research team engineered this technique by dynamically altering the order in which data is loaded into the L2 cache, leveraging local iteration parity to switch between forward and backward loading sequences. Specifically, if the local iteration modulus 2 equals 0, the loading sequence starts at 0 and increments by 1; otherwise, it begins at NKV-1 and decrements by -1, effectively creating a sawtooth pattern.
Experiments employed a loop structure where KV tiles Kj and Vj were loaded and processed within the defined sawtooth pattern, iterating from ‘start’ to ‘end’ with the determined ‘step’ value. This innovative approach directly addresses the issue of reuse distance for large streaming buffers like the KV cache, ensuring that frequently accessed data remains readily available in the L2 cache. The study validated Sawtooth Wavefront Reordering in both CUDA and CuTile environments, meticulously comparing its performance against a baseline cyclic approach across varying batch sizes of 1, 2, 4, and 8. Results demonstrated a substantial reduction in L2 non-compulsory misses, approximately 50% across all tested configurations, translating into significant throughput gains, increasing from approximately 1.3 TFLOPS to 2.4 TFLOPS.
Further validation involved porting the optimisation to CuTile, NVIDIA’s tile-centric programming environment, implementing both a fully static and a tile-based variant. The tile-based implementation locally advanced the sequence loop by a step of 2, alternating the order to achieve the sawtooth pattern, utilising a tile size of 64 × 64 with a batch size of 8, sequence length of 128 × 1024, and head dimension of 64. Profiling revealed a consistent reduction in total L2 miss count from approximately 370 million to 120 million sectors, a 67% decrease, resulting in performance improvements from ∼61 TFLOPS to ∼69 TFLOPS (a 13% increase). In the causal variant, performance increased from ∼41 TFLOPS to ∼66 TFLOPS (a 60% increase), confirming the effectiveness of the sawtooth pattern even with causal masking.
Sawtooth Wavefront Reordering cuts L2 cache misses significantly
Experiments revealed a 50% or greater reduction in L2 misses, demonstrating a significant improvement in data access efficiency. The team meticulously measured L2 cache behaviour using a split-Q dataflow, where Query tiles reside in shared memory while Key and Value tiles are streamed from global memory. Detailed analysis, conducted with sequence lengths of 32K and 128K, quantified the impact of L1 caching on L2 access patterns. Further measurements examined the non-persistent CTA scheduling case, revealing nearly identical L1/L2 behaviour compared to the persistent CTA approach. The L1 hit count remained minimal, confirming that the L1 cache primarily functions as a pass-through buffer for these streaming access patterns.
This simplification allowed the researchers to focus on optimising L2 access patterns directly. Data shows that the technique effectively reduces the number of sectors requested for any operation and increases the number of cache hits from those requests. Future work could explore the application of this technique to other memory-bound kernels and architectures, potentially broadening its impact on high-performance computing.
👉 More information
🗞 Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10
🧠 ArXiv: https://arxiv.org/abs/2601.16032
