Researchers are tackling a critical bottleneck in large language model (LLM) inference: efficiently balancing throughput, latency, and resource utilisation. Amna Masood, Pratishtha Gaur, and Nuwan Jayasena, all from Advanced Micro Devices, present RAPID-Serve, a novel technique designed to concurrently execute the compute-intensive prefill and bandwidth-limited decode stages of LLM serving on the same GPU. This innovative approach overcomes the limitations of existing hybrid batching and disaggregated serving methods, delivering up to 4.1x unconstrained throughput improvement and significantly enhanced performance , up to 32x , when operating under strict service level objectives. RAPID-Serve’s adaptive resource management, potentially utilising CU masking on Instinct GPUs, promises a substantial leap forward for resource-constrained LLM deployment scenarios.

Concurrent LLM Inference on Single GPUs

Scientists have unveiled RAPID-Serve, a novel technique for accelerating large language model (LLM) inference by concurrently executing the prefill and decode phases on the same GPU(s). This breakthrough addresses limitations found in current hybrid batching and disaggregated serving methods, which often struggle to balance low latency with high throughput and efficient resource utilisation. The research team achieved this by enabling controlled intra-GPU concurrency, allowing prefill and decode to progress simultaneously without the overheads associated with traditional disaggregation or the latency inflation of hybrid batching. Experiments demonstrate that RAPID-Serve significantly improves performance, particularly in resource-constrained environments, by intelligently allocating compute resources to each phase based on workload demands.
The core innovation lies in RAPID-Serve’s ability to break the strict phase coupling inherent in hybrid batching, thereby reducing inter-token latency. Unlike existing systems, this approach avoids combining prefill and decode tokens into a single batch, allowing each phase to advance independently. Furthermore, the team developed Adaptive Resource Management, a dynamic allocation system that can optionally leverage Compute Unit (CU) masking on AMD Instinct™ GPUs for fine-grained resource partitioning. This allows the system to tailor compute resources to the specific needs of each phase, minimising interference and ensuring adherence to strict latency Service Level Objectives (SLOs).

The result is a system that delivers both high throughput and predictable low latency, even under demanding workloads. Detailed analysis reveals that RAPID-Serve provides up to a 4.1x (with an average of 1.7x) unconstrained throughput improvement compared to state-of-the-art approaches. More impressively, under SLO constraints, the system achieves a 32x or higher (averaging 4.9x) throughput improvement. These gains are particularly significant for interactive LLM applications, such as coding assistants and chatbots, where maintaining low inter-token latency is crucial for a seamless user experience.

The research establishes that by overlapping prefill and decode, and dynamically managing compute resources, it is possible to overcome the limitations of existing LLM serving systems and unlock substantial performance improvements. This work goes beyond simply improving throughput; it also focuses on efficient resource utilisation. By eliminating KV-cache transfers and maintaining batching efficiency, RAPID-Serve minimises memory pressure and maximises the use of available GPU resources. The team conducted a thorough analysis of the performance characteristics of both prefill and decode phases, identifying their distinct resource demands and tailoring the resource allocation strategy accordingly. This profiling-driven approach ensures that compute resources are allocated optimally, resulting in a system that delivers high performance with minimal overhead. The research opens exciting possibilities for deploying LLMs in a wider range of applications and environments, particularly where resource constraints are a concern.

Scientists Method

Scientists developed RAPID-Serve, a novel technique for large language model (LLM) inference serving that concurrently executes prefill and decode phases on the same GPU(s) to optimise both latency and throughput. The research addresses limitations found in hybrid batching and disaggregated serving, two currently widespread approaches, by enabling parallel processing without the drawbacks of increased latency or resource underutilisation. Experiments employed a system where prefill and decode phases of different requests run simultaneously, avoiding batching into a single execution unit, thus allowing independent progress for each phase. The study pioneered an adaptive resource management system, dynamically allocating compute resources to prefill and decode based on workload intensity.

This allocation model circumvents the rigid constraints of traditional disaggregation, eliminating KV-cache transfer overheads and maintaining batching efficiency, ultimately improving serving performance and resource utilisation. Researchers analysed the performance characteristics of existing serving solutions, quantifying overheads stemming from prefill-decode coupling and KV-cache transfers to establish a baseline for comparison. This analysis revealed distinct performance characteristics and requirements for each phase of LLM inference, informing the design of RAPID-Serve. The system delivers up to 4.1x unconstrained throughput improvement, with an average of 1.7x, and a substantial 32x or higher (average 4.9x) throughput improvement under SLO constraints. Furthermore, the work details how LLM inference consists of two primary phases: prefill, which initialises the KV-cache, and decode, which generates output tokens auto-regressively. Prefill is compute-intensive, benefiting from GPU parallelism, while decode is more sensitive to memory bandwidth and cache reuse, operating sequentially over the KV-cache. The KV-cache size, crucial for decode, is calculated as 2·L·S·H ·D·E, where L represents the number of layers, S the sequence length, H the number of heads, D the hidden size, and E the embedding dimension, highlighting the linear growth of memory consumption with batch size and sequence length. This innovative approach demonstrates a significant advancement in LLM serving, particularly in resource-constrained environments.

RAPID-Serve boosts LLM inference throughput and latency significantly

Scientists have developed RAPID-Serve, a novel technique achieving up to 4.1x unconstrained throughput improvement in LLM inference serving systems, with an average gain of 1.7x. The research addresses limitations in existing methods like hybrid batching and disaggregated serving, demonstrating a significant advancement in resource utilisation and latency reduction. Experiments revealed that RAPID-Serve concurrently executes prefill and decode phases on the same GPU(s), successfully meeting stringent latency Service Level Objectives (SLOs) while maintaining high throughput. Adaptive Resource Management, optionally utilising Compute Unit (CU) masking on Instinct™ GPUs, further enhances performance.

The team measured substantial throughput improvements under SLO constraints, recording gains of 32x and higher, averaging 4.9x compared to state-of-the-art approaches. This breakthrough is particularly impactful in resource-constrained environments, where efficient resource allocation is critical. Analysis of hybrid batching showed a 20% throughput improvement with a 1K token chunk size, although this came at the cost of a 30% increase in Iteration Time Latency (ITL) compared to a 512 token chunk size. These results demonstrate the inherent trade-offs between throughput and latency in hybrid batching systems, necessitating careful orchestration of chunk sizes.

Further investigations into disaggregated serving revealed KV cache transfer overheads of 1.4x for throughput and 1.9x for Time-To-First-Token (TTFT) when evaluated on a node with 8 AMD Instinct™ MI300X GPUs. The study highlighted that these overheads are more pronounced for smaller prompts, limiting the potential benefits of disaggregation in certain scenarios. Moreover, the research quantified memory underutilisation in disaggregated systems, showing up to 50% of memory capacity remaining unused, potentially leading to a 50% throughput cost in memory-capacity limited workloads. RAPID-Serve’s ability to break the lock-step dependence between prefill and decode minimises ITL while sustaining high throughput, offering a compelling alternative to both hybrid batching and disaggregated serving. The work establishes a new benchmark for LLM inference, paving the way for more efficient and responsive AI applications. This innovative approach promises to unlock significant performance gains and reduce resource consumption in a wide range of deployment scenarios.

RAPID-Serve boosts LLM inference performance

Scientists have developed RAPID-Serve, a novel technique for large language model (LLM) inference serving that concurrently executes prefill and decode operations on the same GPU(s). This approach aims to overcome limitations found in existing methods like hybrid batching and disaggregated serving, offering improvements in both throughput and latency. Unlike hybrid batching, which combines requests and risks exceeding latency targets, RAPID-Serve overlaps prefill and decode without merging them into a single phase, maintaining high throughput while adhering to service level objectives (SLOs). Furthermore, RAPID-Serve avoids the key-value (KV) cache transfer overhead and resource underutilisation associated with disaggregated serving, a significant advantage in resource-constrained environments.

Evaluations demonstrate substantial performance gains, with up to 4.1times improvement in unconstrained throughput and up to 32times improvement in throughput under SLO constraints, indicating the effectiveness of concurrent execution for efficient GPU resource utilisation. Adaptive Resource Management, optionally utilising Compute Unit masking, further balances prefill and decode to meet both time-to-first-token (TTFT) and iterative token latency (ITL) requirements! The key achievement of this research is the demonstration of a serving technique that effectively balances throughput and latency for LLM inference. By concurrently executing prefill and decode, RAPID-Serve offers a compelling alternative to current state-of-the-art methods, particularly where resources are limited. The authors acknowledge that performance gains may vary depending on the specific model and hardware configuration. Future work could explore the application of RAPID-Serve to even larger models and investigate the potential for further optimisation through advanced resource scheduling algorithms.

👉 More information
🗞 RAPID-Serve: Resource-efficient and Accelerated P/D Intra-GPU Disaggregation
🧠 ArXiv: https://arxiv.org/abs/2601.11822

Tags:

compute unit masking disaggregated serving hybrid batching KV-cache latency SLOs LLM Inference RAPID-Serve throughput optimisation!

Rapid-serve Achieves 4.1x LLM Inference Speedup with Intra-GPU Disaggregation

Concurrent LLM Inference on Single GPUs

Scientists Method

RAPID-Serve boosts LLM inference throughput and latency significantly

RAPID-Serve boosts LLM inference performance

Rohail T.

Latest Posts by Rohail T.:

Dmrg Achieves Lowest Energy & Error with Optimal 2D Lattice Layouts

Solar-electric Module Achieves 800km to 100km Deorbit for Multi-Debris Remediation

Achieves Four-State Fault-Tolerant Preparation for Steane-type Quantum Circuits