Modern graphics processing units (GPUs) contain specialised hardware designed for efficient, asynchronous data processing, yet current programming methods struggle to fully utilise this capability. Hongzheng Chen, Bin Fan, and Alexander Collins, alongside Bastian Hagedorn, Evghenii Gaburov, and Masahiro Masuda, address this challenge with Tawa, an automated compiler that generates high-performance code specifically tailored for these advanced GPUs. The team’s innovation lies in a new programming abstraction called asynchronous references, which simplifies the complex communication between different parts of the GPU, allowing the compiler to automatically optimise data flow. This approach delivers significant performance gains, achieving speedups of up to 1. 1 over highly optimised cuBLAS kernels and matching the performance of hand-tuned CUTLASS FlashAttention-3 implementations, all while dramatically reducing the programming effort required to unlock peak GPU performance.

Triton Compiler Optimizes GPU Tensor Performance

This research details a new compiler infrastructure, built upon the Triton language, designed to optimize computations involving tensors on GPUs. The goal is to bridge the gap between high-level programming of tensors and efficient utilization of GPU hardware, addressing the challenges of achieving peak performance on modern GPUs. The compiler leverages the Triton programming language, allowing programmers to express tensor computations with explicit control over parallelism, memory hierarchy, and data layout, and automates optimizations by introducing temporal abstractions to manage data dependencies and parallelism across time steps. The compiler focuses on optimizing data movement between different levels of the memory hierarchy, such as global memory, shared memory, and registers, to minimize communication overhead and incorporates hardware-specific optimizations to adapt to different GPU architectures.

It automatically parallelizes tensor operations to maximize GPU utilization, utilizing MLIR, a flexible compiler infrastructure, to ensure modularity, extensibility, and integration with other compiler tools. Evaluations on a variety of tensor operations and deep learning models demonstrate significant performance improvements compared to state-of-the-art compilers like TVM and CUTLASS, particularly for dynamic models. The research highlights Triton as a promising language for expressing tensor computations with the right level of abstraction and control, emphasizing the importance of data movement optimization for achieving high performance on GPUs.

Automated GPU Code Generation via Asynchronous References

Scientists developed Tawa, an automated compiler designed to unlock the full potential of modern GPUs by generating highly optimized, warp-specialized code from high-level, tile-based programs. Recognizing that conventional programming models often fail to effectively utilize the asynchronous dataflow execution capabilities of current hardware, the team engineered a novel approach centered around an “asynchronous references” (aref) abstraction, which expresses warp-level communication without exposing low-level hardware details. The study pioneered a method for automatically managing data movement and computation overlap, particularly crucial for maximizing performance on Hopper architecture GPUs. Experiments employed an NVIDIA H100 SXM5 GPU with 80GB of HBM3e memory and CUDA 12.

7, conducting rigorous statistical analysis to ensure reliable results. Results demonstrate that Tawa achieves up to 79% hardware utilization, delivering speedups of up to 1. 1 over highly optimized cuBLAS GEMM kernels and matching the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel. For attention workloads, Tawa attained a 1. 2 speedup over Triton, while for FP8 GEMM at smaller K values, Tawa achieved up to a 3. 99x performance increase over TileLang, highlighting the effectiveness of the aref-based partitioning and automatic warp specialization.

Tawa Compiler Achieves GPU Speedup with Asynchronous References

The research team presents Tawa, an automated compiler designed to unlock the full potential of modern GPUs by generating highly efficient, warp-specialized code from high-level tile-based programs. Conventional programming models often misalign with the parallel hardware, creating a significant challenge for developers, but Tawa addresses this by automatically managing complex dataflow pipelines and relieving the need for manual kernel rewriting. Central to this achievement is a novel abstraction called asynchronous references, or aref, which expresses warp-level communication without exposing low-level hardware details, simplifying the programming process considerably. Experiments on H100 GPUs demonstrate that Tawa achieves a 1.1 speedup over highly optimized cuBLAS GEMM kernels, showcasing its ability to deliver substantial performance gains. For attention workloads, the compiler attains a 1. 2 speedup over Triton, while matching the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel, a significant accomplishment given the extensive manual effort typically required for such optimization. The system achieves this performance by systematically transforming high-level asynchronous references into explicit synchronization and memory-transfer instructions that execute directly on the GPU. Further optimizations integrated into the compilation flow include cooperative compute warp groups, which enable multiple warps to collaboratively compute the same tile, and persistent kernels, which reduce launch overhead by keeping CTAs resident throughout execution. These advancements demonstrate Tawa’s ability to automatically benefit programs without requiring manual restructuring, representing a significant step forward in GPU programming and unlocking new levels of performance for demanding workloads.

Tawa Achieves Warp Specialization and Speedups

Tawa represents a significant advance in GPU programming by automating the generation of high-performance, warp-specialized code from high-level tile-based programs. The research demonstrates that Tawa effectively bridges the gap between the task-parallel hardware of modern GPUs and the conventional SIMT programming model, which often fails to fully utilize available resources. Central to this achievement is the introduction of asynchronous references, or ‘aref’, a novel abstraction that manages warp-level communication without exposing low-level hardware details, thereby simplifying the development process. Evaluations across a range of large language model kernels show that Tawa delivers substantial performance gains, achieving speedups of up to 1.

1 compared to highly optimized cuBLAS GEMM kernels and matching the performance of hand-optimized CUTLASS FlashAttention-3 with considerably less programming effort. These benefits extend across different precision levels, from FP16 to FP8, and both noncausal and causal attention semantics, indicating the generalizability of Tawa’s automatic warp specialization policies. Further investigation reveals that larger asynchronous reference sizes and moderate levels of pipeline depth consistently improve performance, while persistent kernels enhance stability and maximize cache reuse.

👉 More information
🗞 Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References
🧠 ArXiv: https://arxiv.org/abs/2510.14719

Tags:

asynchronous references cuBLAS GEMM CUTLASS dataflow execution FlashAttention GPU computing LLM kernels Tile-Based Programming Triton warp specialization

Tawa Compiler Automates Warp Specialization for Modern GPUs with Asynchronous References

Triton Compiler Optimizes GPU Tensor Performance

Automated GPU Code Generation via Asynchronous References

Tawa Compiler Achieves GPU Speedup with Asynchronous References

Tawa Achieves Warp Specialization and Speedups

Quantum Mechanic

Latest Posts by Quantum Mechanic:

Quantum Processor and 152,064 Classical Nodes Compute Electronic Structure at Full Scale

Multiplexed Double-Transmon Coupler Scheme Achieves 96% Fidelity and Reduces Wiring Complexity in Quantum Processors

Current Cross-Correlation Spectroscopy Extracts Electron Traversal Times in Majorana Bound States Systems