Modern graphics processing units (GPUs) are becoming increasingly complex, incorporating powerful dedicated units alongside their parallel cores, and realising their full potential requires sophisticated software scheduling. Rupanshu Soi, Rohan Yadav, and Fredrik Kjolstad, from Stanford University, along with colleagues including Alex Aiken from Stanford and Maryam Mehri Dehnavi and Michael Garland from NVIDIA, address the challenge of optimally combining software pipelining and warp specialization, techniques commonly used to maximise GPU performance. Their work introduces a new approach that formulates these scheduling problems as a single, holistic optimisation, allowing automated derivation of optimal schedules using existing constraint solvers. This results in Twill, a system that not only matches, but proves the optimality of, manually-tuned schedules for demanding applications like Flash Attention on the latest Hopper and Blackwell GPU architectures, representing a significant step towards fully automated GPU performance optimisation.

Software Pipelining Optimizes GPU Deep Learning

This research details a system for optimizing deep learning performance on modern GPUs through software-defined pipelining. Existing compilers often struggle to fully utilize the potential of these GPUs, particularly for complex operations, and manual optimization is time-consuming and requires specialized expertise. The researchers introduce a system that explicitly controls the execution stages of a computation on the GPU, using a new intermediate representation to expose parallelism and data dependencies. This system automatically schedules operations to maximize throughput and minimize latency, leveraging warp-specialization to further improve performance and support modern GPU features like Tensor Cores and asynchronous memory access.

The system achieves significant speedups on various deep learning benchmarks, including attention mechanisms, and offers increased flexibility for adapting to different GPU architectures and workloads. It reduces the need for manual optimization by automating scheduling and warp-specialization, introducing novel techniques for representing and scheduling computations on GPUs. The system builds upon existing compiler infrastructure and employs static analysis and runtime information to optimize the pipeline, supporting techniques like loop tiling, data layout transformations, and memory access coalescing.

Twill Optimizes GPU Schedules, Matches Experts

Scientists have developed Twill, a system that automatically optimizes software schedules for modern GPUs, achieving peak performance for computationally intensive tasks. Twill formulates software pipelining and warp specialization as a joint optimization problem, solved using standard constraint solvers, guaranteeing optimal schedules. The system analyzes straight-line code derived from modulo scheduling, formulating a system of constraints that ensures both data dependencies and functional unit limitations are respected. Twill effectively addresses the challenges of fitting large working sets into the GPU’s register file and managing variable-latency operations, delivering a significant advancement in GPU programming and enabling developers to achieve maximum performance with minimal manual effort.

Twill Optimizes GPU Pipelining and Specialization

Scientists present Twill, a system that automatically optimizes software schedules for modern GPUs, achieving peak performance for computationally intensive tasks. Twill formulates software pipelining and warp specialization as a joint optimization problem, solved using standard constraint solvers, guaranteeing optimal schedules. By considering these traditionally separate optimization techniques as a single, unified problem, Twill leverages established constraint solvers to derive complex schedules that maximize hardware utilization. Twill’s heuristic-free design and extensibility promise to simplify the process of adapting software to new GPU architectures, reducing reliance on expert intuition and manual tuning.

Twill Optimizes GPU Schedules, Matches Experts

👉 More information
🗞 Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs
🧠 ArXiv: https://arxiv.org/abs/2512.18134

Tags:

Blackwell architecture constraint solvers flash attention GPU architectures Hopper architecture iterative programs Schedule Optimization software pipelining warp specialization

Tensor Core GPUs Unlock Performance Gains with Advanced Software Pipelining and Warp Specialization

Software Pipelining Optimizes GPU Deep Learning

Twill Optimizes GPU Schedules, Matches Experts

Twill Optimizes GPU Pipelining and Specialization

Twill Optimizes GPU Schedules, Matches Experts

Rohail T.

Latest Posts by Rohail T.:

Language Statistics Shape Geometry in One Dimension

Gravity and Matter Linked by Quantum Entanglement

Entanglement Link Between Regions Undergoes Phase Transition with Increasing Separation