Researchers are tackling a significant bottleneck in the rapidly evolving field of artificial intelligence: the inefficient sampling phase of Diffusion Large Language Models (dLLMs). Binglei Lou from Imperial College London, alongside Haoran Wu and Yao Lai from the University of Cambridge, and colleagues, demonstrate that sampling can contribute up to 70% of total inference latency, largely due to memory access challenges. Their work identifies key instructions for Neural Processing Unit (NPU) optimisation and presents a novel design utilising lightweight vector primitives and in-place memory reuse, achieving up to a 2.53x speedup compared to a high-end NVIDIA GPU. This research is significant because it moves beyond traditional GEMM-centric NPU designs, paving the way for more efficient hardware acceleration of dLLMs and ultimately, faster and more accessible AI applications.

The team achieved a breakthrough by recognising the structural mismatch between the non-GEMM-centric sampling workload and the GEMM-centric execution pipelines prevalent in modern NPUs.

The study reveals that existing NPUs, optimised for dense matrix computations, struggle with the control-intensive, reduction-heavy, and memory-irregular operations inherent in diffusion sampling. To overcome this limitation, the researchers developed d-PLENA, a vector-scalar-centric architectural extension designed for efficient on-NPU execution of dLLM sampling. This innovative design employs lightweight non-GEMM vector primitives, enabling in-place computation and phased memory reuse, while maintaining numerical equivalence to standard implementations. Furthermore, d-PLENA incorporates a decoupled mixed-precision memory hierarchy, separating floating-point and integer data domains to reduce memory fragmentation and control-path interference.
Experiments show that these combined optimisations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent technology node. The research establishes a hardware-friendly execution flow for softmax-based diffusion sampling, streamlining the process and minimising computational overhead. A key innovation lies in the proposed set of ISA primitives specifically designed to accelerate ArgMax, Top-k selection, and masked token updates, essential operations for efficient diffusion sampling. The team also open-sourced their cycle-accurate simulation and post-synthesis RTL verification code, ensuring reproducibility and facilitating further research in this area.

This work opens new avenues for accelerating dLLMs, potentially enabling real-time applications and reducing the computational cost of large language model inference. By addressing the long-tail bottleneck of the sampling stage, the researchers have paved the way for more efficient and scalable dLLM deployments. The decoupling of memory domains and the use of lightweight vector primitives represent a significant departure from traditional NPU designs, offering a promising path towards hardware acceleration tailored to the unique demands of diffusion-based language models. Profiling revealed that sampling accounts for up to 70% of total inference latency, stemming from vocabulary-wide logit loads, reduction-based token selection, and iterative masked updates. To overcome these limitations, the research team developed a multi-domain storage hierarchy, utilising High Bandwidth Memory (HBM) for large tensors in MX format and decoupling on-chip storage into Vector, Floating-Point (FP), and Integer (Int) SRAM. Logits are streamed from HBM into Vector SRAM via a dedicated Dequantizer, converting MX-encoded data into a configurable floating-point format, specifically BF16.
The system’s execution core comprises specialized compute units coordinated by an instruction decoder, featuring a Vector Unit with a Reduction Unit and an Elementwise Unit. The Reduction Unit processes data in chunks of V LEN, performing operations like Max and Sum to generate scalar outputs forwarded to FP or Int Units. The Elementwise Unit maintains vector dimensionality, supporting multi-operand vector operations. To accelerate non-linear kernels, the FP Unit provides hardware support for transcendental functions, including exponential and reciprocal calculations, with results broadcast back to the Vector Unit or buffered in FP SRAM.

Researchers streamlined the sampling flow into a hardware-friendly Stable-Max formulation, decomposing computations into atomic primitives mapped to dedicated hardware modules. This optimized method replaces conventional softmax with a process that extracts the maximum value and accumulates shifted exponentials, writing intermediate values in-place within the Vector SRAM to maximise memory utilisation. The study pioneered a set of ISA extensions, summarised in Table I of the work, to facilitate dLLM execution and phased sampling. Algorithm 2 details the intra-block sampling flow, beginning with a blocked slice of the prompt and mask tokens as input.

Within each diffusion step, the model generates logits, and a mask identifies positions for token updates. The team implemented four hardware-visible execution phases: HBM to Vector/Scalar transfer, Scalar processing, Scalar to Vector/Scalar transfer, and final token updates. Specifically, the first phase preloads data chunks from HBM to Vector SRAM, processing Vchunk in segments of V LEN, applying the Stable-Max method to calculate probabilities and identify the most likely token. The second phase stores scalar probability values in FP and Int SRAM. The third phase maps these values back to the Vector Unit and sorts top-k candidates, while the final phase updates the token sequence. This approach delivers up to a 2. The research team profiled modern GPUs and discovered that the sampling phase accounts for up to 70% of total model inference latency. This latency stems primarily from substantial memory loads and writes associated with vocabulary-wide logits, reduction-based token selection, and iterative masked updates. The Vector Unit consists of a Reduction Unit and an Elementwise Unit, processing data in chunks of V LEN and supporting operations like Max and Sum to produce scalar outputs. The FP Unit provides hardware-native support for transcendental functions, including ex and 1/x, with results broadcast back to the Vector Unit or buffered through FP SRAM. Results demonstrate a streamlined, hardware-friendly Stable-Max formulation of the sampling flow, decomposing computation into atomic primitives mapped to dedicated hardware modules.

Specifically, the Reduction Unit extracts the maximum value, m, and accumulates the sum of shifted exponentials, sum exp, while the FP Unit evaluates transcendental functions. Intermediate exp shifted values are written in-place, overwriting the original logits buffer in Vector SRAM, maintaining high memory utilization. Measurements confirm that this approach avoids multiple memory passes and global normalization required by conventional softmax. Data shows the team reorganized the sampling flow into four hardware-visible execution phases, detailed in Algorithm 2. In PHASE ❶, logits are streamed from HBM into Vector SRAM in chunks using PLENA’s H_PREFETCH_V instruction.

The confidence scalar x0 p scalar and token index x0 scalar are computed via the Stable-Max reduction and the V_RED_MAX_IDX instruction, achieving max() and argmax() results in a single instruction. PHASE ❷ writes these scalar values to FP and Int SRAM using decoupled store instructions. PHASE ❸ reconstructs dense vectors and performs Top-k comparison using S_MAP_V_FP and V_TOPK_MASK, processing L elements. Finally, PHASE ❹ applies masked token updates in the integer domain with V_SELECT_INT, functionally equivalent to a torch. where() operation. The team also open-sourced cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations.

NPU design accelerates diffusion model sampling significantly

Researchers have identified significant architectural challenges in the sampling phase of diffusion large language models (dLLMs), demonstrating that it can account for up to 70% of total inference latency. Profiling revealed that substantial memory loads and writes, alongside irregular memory accesses, are primary bottlenecks for conventional neural processing units (NPUs). This work addresses these limitations by proposing a novel NPU design incorporating lightweight vector primitives, in-place memory reuse, and a decoupled mixed-precision memory hierarchy. The proposed design achieves a speedup of up to 2.53x compared to a NVIDIA RTX A6000 GPU under equivalent technology node conditions.

Functional equivalence with existing dLLM PyTorch implementations has been verified through cycle-accurate simulation and post-synthesis RTL verification, alongside detailed area and power evaluations for varying vector lengths. The authors acknowledge that as model-side kernels become increasingly optimised and quantized, the sampling phase will remain a critical factor in end-to-end latency. Future research should explore the broader applicability of these primitives to other reduction and selection-heavy workloads, potentially integrating them as standard features in future NPUs. The findings highlight the need for architectural innovations specifically tailored to the unique demands of dLLM sampling, paving the way for more efficient and scalable diffusion-based language models. The authors suggest that these improvements are particularly important as dLLM adoption increases and optimisation efforts continue.

👉 More information
🗞 Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling
🧠 ArXiv: https://arxiv.org/abs/2601.20706

Tags:

Diffusion Large Language Models dLLM sampling inference latency iterative denoising masked updates mixed-precision memory hierarchy NPU optimisation on-chip SRAM reduction-based token selection vector primitives.

Diffusion LLM Sampling Achieves 70% Latency Reduction with Novel NPU Design

NPU design accelerates diffusion model sampling significantly

Rohail T.

Latest Posts by Rohail T.:

GPU Portability Layers: Evaluating Application Characteristics for NVIDIA and Intel Deployments

IRID + AIMING: The Pure-Play Quantum Computing Stocks vs Tech Giants Defining the Next Computing Era

GPU Batch SVD Solver Achieves Unmatched Performance for Numerous Small Problems