Researchers are tackling a critical challenge in large language model (LLM) training: maintaining reproducibility without sacrificing performance. Xinwei Qiang, Hongmin Chen, and Shixuan Sun, from Shanghai Jiao Tong University and ByteDance Seed, alongside Jingwen Leng, Xin Liu, and Minyi Guo et al., demonstrate that deterministic attention, essential for consistent results, can significantly reduce training throughput, by as much as 37.9% in some implementations. Their work introduces DASH (Deterministic Attention Scheduling for High-Throughput), a novel approach that formulates deterministic attention as a scheduling problem and delivers schedules minimising computational bottlenecks. By optimising the order of operations, DASH demonstrably improves throughput by up to 1.28x, representing a substantial step towards efficient and reproducible LLM training.
The team addressed a critical challenge in LLM training: the substantial performance cost associated with ensuring deterministic, bitwise identical results across multiple runs. In widely used attention implementations like FlashAttention-3, enabling deterministic backward passes can reduce throughput by up to 37.9% due to the serialisation of gradient accumulation operations required for numerical consistency. This performance loss arises from inefficient scheduling of compute and gradient-reduction phases, leading to underutilisation of hardware resources.
Researchers formulated the deterministic attention backward pass as a scheduling problem on a Directed Acyclic Graph (DAG), deriving schedules designed to minimise the critical path length. Building upon this formulation, they present DASH (Deterministic Attention Scheduling for High-Throughput), a framework incorporating two complementary scheduling strategies. The first, Descending Q-Tile Iteration, employs a reversed query-block traversal to reduce pipeline stalls specifically in causal attention. The second, Shift Scheduling, is a theoretically optimal schedule within the DAG model, reducing pipeline stalls for both full and causal attention masks.
This innovative approach tackles the core issue of aligning tile execution with accumulation ordering, a misalignment identified as the primary cause of performance degradation. Empirical evaluations conducted on NVIDIA H800 GPUs demonstrate that DASH effectively narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28× compared to the baseline, representing a significant advancement in the efficiency of reproducible LLM training. This improvement is achieved by parallelising the reduction process, allowing CTAs to begin reduction on different tiles concurrently, thereby avoiding bottlenecks caused by sequential operations.
The work establishes a new benchmark for deterministic LLM training, offering a pathway to more efficient and reliable large-scale model development. The research identifies that the performance gap isn’t inherent to serialisation, but rather a consequence of suboptimal tile scheduling and a rigid accumulation order. By modelling the deterministic backward pass as a DAG, the team was able to design strategies that optimise the critical path length, ensuring a more balanced workload and reducing contention during serial reduction operations. This DAG-based formalisation represents a novel contribution, enabling principled optimisation of the scheduling process. The open-sourcing of the DASH code at https://github.com/SJTU-Liquid/deterministic-FA3 facilitates further research and adoption within the LLM community.
DAG Scheduling Optimises Deterministic Attention Backward Passes
Scientists addressed the performance limitations of deterministic attention in large language model training, a crucial aspect for reproducibility. The research team identified that deterministic backward passes in attention implementations like FlashAttention-3 can experience up to a 37.9% throughput reduction compared to non-deterministic counterparts due to serialisation of gradient accumulation. To overcome this, they formulated the deterministic attention backward pass as a scheduling problem on a Directed Acyclic Graph (DAG), aiming to minimise the critical path length. This innovative approach enabled the development of DASH (Deterministic Attention Scheduling for High-Throughput), incorporating two complementary strategies.
The core of DASH lies in Descending Q-Tile Iteration, a reversed query-block traversal designed to reduce pipeline stalls specifically in causal attention mechanisms. Researchers engineered this method to shrink pipeline stalls by processing query blocks in reverse order, optimising data flow and reducing dependencies. Complementing this, Shift Scheduling was developed as a theoretically optimal schedule within the DAG model, reducing pipeline stalls for both full and causal attention masks. Experiments employed NVIDIA H800 GPUs, CUDA 12.6, and Triton 3.4, with all kernels implemented by extending the FlashAttention-3 implementation.
Baseline comparisons were conducted against the deterministic backward pass of FlashAttention-3 and the Triton tutorial’s causal attention implementation. The study benchmarked performance with a fixed total of 16,384 tokens, varying sequence lengths from 512 to 16,384, and tested hidden dimensions of 2,048 with head dimensions of 64 and 128, all using BF16 precision random inputs. Results demonstrated that DASH improved the throughput of the attention backward pass by up to 1.28 compared to the baseline, significantly enhancing the efficiency of reproducible LLM training. Detailed analysis revealed that at a sequence length of 16,384 and a KV block size of 128, computation was distributed across 128 Streaming Multiprocessors (SMs).
The team observed that inter-SM communication latency, particularly accesses to remote L2 cache segments, ranging from 200 to over 500 cycles, became a limiting factor. While Shift Scheduling’s intricate dependency graph offered computational benefits, it proved more sensitive to this communication overhead at extreme parallelism, occasionally exhibiting slight performance degradation. For causal attention masks, both Descending Q-Tile Iteration and Symmetric Shift Scheduling consistently improved throughput, with Symmetric Shift Scheduling demonstrating superior workload balancing at a head dimension of 64.
DASH scheduling improves deterministic LLM training throughput
Scientists have developed a new scheduling framework, DASH (Deterministic Attention Scheduling for High-Throughput), to address performance bottlenecks in deterministic large language model (LLM) training. Deterministic training, crucial for reproducibility, often incurs a significant performance cost, with throughput reductions of up to 37.9% observed in widely used attention implementations like FlashAttention-3. This loss arises from serializing gradient accumulation operations to ensure numerical consistency, leading to hardware underutilization. Researchers formulated the deterministic attention backward pass as a scheduling problem on a Directed Acyclic Graph (DAG) to minimise the critical path length.
Experiments revealed that DASH narrows the performance gap of deterministic attention, improving throughput by up to 1.28 compared to the baseline. The team measured this improvement on NVIDIA H800 GPUs, demonstrating a substantial advancement in the efficiency of reproducible LLM training. This breakthrough delivers a significant speedup for the deterministic attention backward pass, addressing a key challenge in scaling LLMs across thousands of GPUs. Data shows that the performance degradation in deterministic FlashAttention stems from a misalignment between tile execution and accumulation ordering, creating bottlenecks in the reduction process.
The work introduces two complementary scheduling strategies: Descending Q-Tile Iteration and Shift Scheduling. Descending Q-Tile Iteration, a reversed query-block traversal, shrinks pipeline stalls in causal attention, while Shift Scheduling, a theoretically optimal algorithm within the DAG model, reduces pipeline stalls for both full and causal masks. Measurements confirm that Shift Scheduling employs a phase-shifted assignment of computational tasks, creating a perfectly staggered execution pattern and approaching the model’s theoretical utilization bound. Tests prove that DASH effectively parallelizes the reduction process, allowing CTAs to begin reduction on different tiles concurrently, unlike the naive schedule which forces sequential reductions. The research identified that the principal source of performance degradation in deterministic attention is the misalignment between tile execution and accumulation ordering. Consequently, the study provides a DAG-based formalization of deterministic attention backward scheduling, enabling principled optimization of critical path length and ultimately improving the efficiency of LLM training.
DASH framework accelerates deterministic attention on GPUs
Scientists have addressed the performance penalty associated with deterministic backward passes in attention mechanisms, a crucial aspect of large language model (LLM) training. They formulated the computation as a scheduling problem on a Directed Acyclic Graph (DAG), introducing DASH, a framework with two scheduling strategies: Descending Q-Tile Iteration and Shift Scheduling. Descending Q-Tile Iteration accelerates causal attention through a reversed query-block traversal, while Shift Scheduling aims for a theoretically optimal schedule within the DAG model to reduce pipeline stalls for both full and causal masks. Empirical evaluations on H800 GPUs demonstrated that DASH significantly narrows the performance gap of deterministic attention, improving throughput by up to 1.28times compared to baseline methods.
This advancement contributes to more efficient and reproducible LLM training. The authors acknowledge that theoretical optimality does not always translate to practical superiority, identifying hardware limitations like register pressure and inter-SM communication latency as critical factors influencing performance. Future work could explore further optimisation considering these hardware realities. The research highlights the importance of co-optimising execution and accumulation order, rather than solely focusing on bandwidth or memory. By providing a suite of solutions, DASH enables practitioners to achieve high throughput attention while maintaining reproducibility in LLM training. The findings suggest that a nuanced approach, balancing theoretical optimality with practical hardware constraints, is essential for maximising performance in this domain.
👉 More information
🗞 DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training
🧠 ArXiv: https://arxiv.org/abs/2601.21824
