Large language model inference frequently encounters challenges with slow processing speeds and difficulties scaling across different computing environments, from powerful data centres to mobile devices. To address this, Fengze Yu, Leshu Li, Brad McDanel from Franklin and Marshall College, and Saiqian Zhang from New York University, present a new framework called DSD, a distributed speculative decoding solution that significantly accelerates the generation of text. This innovative approach extends existing techniques by coordinating processing across multiple devices, effectively predicting and pre-calculating likely text sequences. Through detailed simulations and the development of an adaptive control policy, the team demonstrates that DSD achieves substantial performance gains, delivering up to a 1. 1x speedup and 9. 7% higher throughput compared to current methods, and paving the way for more responsive and scalable large language model applications.

Distributed Speculative Decoding Simulation and Adaptation

The research team developed a distributed speculative decoding (DSD) framework to accelerate large language model (LLM) inference across edge and cloud environments, overcoming limitations in both decoding latency and scalability. Recognizing the need for dedicated simulation tools for this distributed approach, the scientists engineered DSD-Sim, a discrete-event simulator that accurately captures network dynamics, batching processes, and scheduling complexities inherent in multi-device LLM deployments. This simulator models interactions between devices during decoding, providing crucial insights into performance bottlenecks and optimization opportunities. Building upon these insights, the researchers designed an Adaptive Window Control (AWC) policy, a data-driven approach that dynamically adjusts the speculation window size during inference, optimizing throughput by balancing the benefits of increased speculation with the potential costs of incorrect predictions, ensuring both performance and stability.

To rigorously evaluate the effectiveness of DSD and AWC, the scientists conducted extensive experiments across diverse workloads, demonstrating that DSD achieves up to a 1. 1x speedup and a 9. 7% higher throughput, significantly improving both latency and scalability compared to conventional methods. The results confirm that the combination of DSD-Sim for performance modeling and AWC for dynamic control enables agile and scalable LLM serving across heterogeneous edge and cloud infrastructures.

Distributed Speculative Decoding Accelerates Language Models

Scientists developed a distributed speculative decoding (DSD) framework that accelerates large language model (LLM) inference across edge and cloud environments. The work overcomes limitations of existing techniques confined to single-node execution by coordinating draft-target execution across multiple devices. Experiments demonstrate that DSD achieves up to a 1. 1x speedup and 9. 7% higher throughput compared to existing speculative decoding baselines, enabling more agile and scalable LLM serving.

To simulate this distributed paradigm, researchers introduced DSD-Sim, a discrete-event simulator that accurately captures network dynamics, batching processes, and scheduling considerations. Building on insights from the simulator, the team designed an Adaptive Window Control (AWC) policy that dynamically adjusts the speculation window size to optimize throughput. The AWC policy utilizes a Window Control DNN (WC-DNN) which receives input from a performance analyzer encoding system state into a feature vector, including queue depth utilization, recent token acceptance rate, per-link round-trip time statistics, time per output token, and the prior speculation window size. The WC-DNN predicts the optimal speculation window size, trained using supervised regression with an L1 loss and the AdamW optimizer for 100 epochs. To ensure stable execution, the team implemented techniques including clamping window size predictions, applying exponential smoothing with a smoothing factor of 0. 4, and introducing hysteresis for mode switching, preventing rapid fluctuations in the predicted window size.

Distributed Speculative Decoding Scales LLM Inference

This research presents a distributed speculative decoding (DSD) framework designed to accelerate and scale large language model (LLM) inference across edge and cloud environments. The team successfully extended single-node speculative decoding techniques to multi-device deployments through coordinated draft-target execution, addressing a key limitation in existing approaches. To facilitate this work, the researchers developed DSD-Sim, a discrete-event simulator that accurately models network, batching, and scheduling dynamics, providing a crucial tool for evaluating distributed inference systems. Building upon the insights gained from simulation, the team designed an Adaptive Window Control (AWC) policy that dynamically adjusts the speculation window size, optimising both throughput and stability.

Experiments demonstrate that DSD achieves significant improvements in performance, delivering up to a 1. 1x speedup and 9. 7% higher throughput compared to existing speculative decoding methods. This research represents a substantial advance in efficient LLM serving, enabling more agile and scalable deployments across diverse computing platforms.

👉 More information
🗞 DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
🧠 ArXiv: https://arxiv.org/abs/2511.21669

Tags:

adaptive window control discrete-event simulation distributed speculative decoding draft-target execution edge-cloud environments Large Language Models LLM Inference speculative decoding

Dsd: Distributed Speculative Decoding Achieves 1.1x Throughput Gain with 9.7% Latency Reduction for Edge-Cloud Large Models

Distributed Speculative Decoding Simulation and Adaptation

Distributed Speculative Decoding Accelerates Language Models

Distributed Speculative Decoding Scales LLM Inference

Rohail T.

Latest Posts by Rohail T.:

Quantum Light’s Wave-Particle Balance Now Fully Tunable

AI Swiftly Answers Questions by Focusing on Key Areas

Machine Learning Sorts Quantum States with High Accuracy