Decoding Large Language Models (DLLMs) presents a significant computational challenge, hindering their widespread application despite their potential. Researchers Kaihua Liang, Xin Tan, and An Zhong, alongside colleagues from King Abdullah University of Science and Technology and The Chinese University of Hong Kong, have identified a core inefficiency in how DLLMs process information during decoding. Their work reveals that substantial computational power is currently wasted on tokens that are not immediately decodable. To address this, they introduce FOCUS, a novel inference system which dynamically prioritises computation on the most promising tokens, effectively increasing processing speed and scalability. Empirical results demonstrate FOCUS achieves a throughput improvement of up to 3.52x compared to existing systems like LMDeploy, without compromising the quality of generated text, representing a substantial step towards more efficient and accessible DLLM deployment.
This work addresses a key limitation hindering the deployment of DLLMs: high decoding cost due to inefficient computation during the decoding process.
Researchers identified that while computation is parallelized across token blocks, only a small fraction of tokens are actually decodable at each step, leading to wasted computational resources on non-decodable tokens. The team observed a strong correlation between attention-derived token importance and the probability of a token being decoded, forming the basis for their innovative approach.
FOCUS dynamically concentrates computation on decodable tokens and evicts non-decodable ones during inference, effectively increasing the batch size and alleviating compute limitations. This system enables scalable throughput, a crucial factor for real-world applications of DLLMs. Empirical evaluations reveal that FOCUS achieves up to a 3.52x improvement in throughput compared to the production-grade engine LMDeploy, all while maintaining or even improving the quality of generated text across multiple benchmarks.
The researchers highlight that DLLM inference is fundamentally compute-bound, unlike traditional Auto-Regressive LLMs which are often memory-bound. The study establishes that DLLMs compute for entire blocks of tokens, yet typically only decode approximately 10% of them at each diffusion step. This inefficiency stems from the need to compute attention for all block-wise query tokens, drastically increasing computational intensity.
FOCUS tackles this by reducing the number of processed tokens per step by 65%, 80%, thereby mitigating the compute bottleneck. By precisely identifying promising candidate tokens in the early layers and dynamically evicting the rest, FOCUS concentrates resources where they are most effective. Researchers observed a strong correlation between attention-derived token importance and the probability of a token being decoded.
Based on this, the study pioneered FOCUS, an inference system designed to dynamically focus computation on decodable tokens and evict non-decodable ones during inference. This approach increases the effective batch size, alleviating compute limitations and enabling scalable throughput. The team engineered a system that predicts decodable tokens to eliminate computational redundancy, contrasting with existing methods like LLADA and FAST-DLLM.
Experiments employed a Block-Diffusion paradigm, processing tokens in segments while treating preceding sequences as fixed context, enabling exact Key-Value cache reuse similar to Auto-Regressive LLMs. FOCUS further enhances efficiency by predicting decodable tokens, unlike SDAR and LLaDA2.0 which leverage Block-Diffusion Continual Pre-Training.
The research demonstrated that FOCUS achieves up to a 3.52x throughput improvement over the production-grade engine LMDeploy, while maintaining or improving generation quality across multiple benchmarks. The system operates by confining bi-directional attention within block-wise structures, eliminating periodic KV cache recomputation.
Researchers analysed the arithmetic intensity of DLLMs, revealing a shift from memory-bound to compute-bound regimes. This analysis showed that while prior optimizations reduce latency in low-concurrency settings, they reach a ceiling in production-scale scenarios due to redundant computation across the entire block, yielding approximately 10% decodable tokens.
FOCUS reduces the number of processed tokens per step by 65%, 80%, successfully addressing the compute bottleneck. The study highlights that FOCUS concentrates computational resources on tokens with high decoding probabilities, identified in the early layers, and dynamically evicts the rest. The research identifies a key inefficiency in DLLM decoding where computation is parallelized, but only a small subset of tokens are actually decodable at each step, wasting compute on non-decodable tokens.
Experiments revealed that, on average, only 2.24 tokens are decoded with a standard deviation of 3.34 and a median of 1.00 across benchmarks like HumanEval and MBPP. Data shows that approximately 90% of block-wise computation is redundant, as the system computes the entire block despite decoding only around 10% of tokens successfully.
Tests prove that processing a block of 32 tokens causes a surge in arithmetic intensity, with FLOPs scaling linearly with the query size Q, approximately Q · (8h2 + 4dffh + 4hL), where h is the hidden size, dff is the intermediate MLP size, and L is the context length. The team measured that unlike Auto-Regressive LLMs (ARLLMs), DLLMs yield diminishing returns as batch sizes increase due to a lack of idle compute cycles.
Researchers observed a strong correlation between attention-derived token importance and token-wise decoding probability, particularly in the early layers of the model. By dynamically focusing computation on decodable tokens and evicting non-decodable ones, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput.
The study defines token importance (Ij) by aggregating attention weights from all query tokens within a block using the formula: Ij = X i,h Softmax MaxPool1D(S(h) i,j ). Measurements confirm a clear differentiation between decodable and non-decodable tokens starting from Layer 1, with decoded tokens dominating attention mass.
The importance delta, defined as the attention score difference between Layer 1 and Layer 0 (∆Ij = I(Layer1) j −I(Layer0) j ), serves as a robust predictor for decodability, acting as a Common Mode Rejection mechanism. Their research demonstrates that standard block-wise processing allocates approximately 90% of compute to these non-decodable tokens, hindering scalable throughput.
To address this, researchers introduced FOCUS, a novel inference system designed to dynamically focus computation on decodable tokens and evict those that are not, thereby increasing effective batch size and alleviating computational limitations. Empirical evaluations confirm that FOCUS achieves throughput improvements of up to 3.52times compared to the LMDeploy production engine, while maintaining or even enhancing generation quality across various benchmarks.
The system’s proactive filtering of high-confidence, yet incorrect, tokens further contributes to its efficiency and reliability. The authors acknowledge a limitation in that the decodability prediction relies on early-layer attention patterns, and future work could explore more sophisticated predictors to further unlock the efficiency potential of DLLMs.
This work establishes a robust, training-free baseline for efficient DLLM inference, shifting the paradigm from redundant computation to proactive, predictive decoding. By demonstrating substantial gains in throughput without compromising quality, FOCUS offers a practical solution to a key bottleneck in deploying these powerful language models. The findings suggest a promising research direction towards more intelligent and resource-conscious language model architectures, potentially enabling wider accessibility and application of DLLMs.
👉 More information
🗞 FOCUS: DLLMs Know How to Tame Their Compute Bound
🧠 ArXiv: https://arxiv.org/abs/2601.23278
