APEX, a new scheduling strategy, enhances large language model inference by optimising compute distribution between CPUs and GPUs. Evaluations using LLaMa models demonstrate throughput improvements of 11% to 96% on T4 and A10 GPUs, maintaining latency, and exceeding performance of existing hybrid schedulers, particularly for lengthy outputs.

The increasing demand for large language model (LLM) applications necessitates efficient deployment on hardware with limited resources. A significant constraint is the substantial memory requirement of the ‘KV cache’ – data storing past interactions crucial for auto-regressive decoding – which often exceeds the capacity of available graphics processing units (GPUs). Researchers are exploring hybrid CPU-GPU execution, offloading portions of the processing to the central processing unit, but effective scheduling to maximise parallelism remains a challenge. Now, Jiakun Fan, Yanglin Zhang, Xiangchen Li, and Dimitrios S. Nikolopoulos present APEX, a profiling-informed scheduling strategy designed to optimise CPU-GPU overlap during LLM inference. Their work, titled ‘Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs’, details a dynamic dispatching system that predicts execution times to enhance throughput and maintain latency, particularly on lower-cost hardware.

Optimised Hybrid Scheduling for Large Language Model Inference

Deploying large language models (LLMs) for real-time applications presents significant challenges, frequently constrained by the limited memory capacity of graphics processing units (GPUs). During auto-regressive decoding – the process by which LLMs generate text sequentially – the key-value (KV) cache, which stores past computations, expands rapidly, exacerbating this limitation. Hybrid execution, which offloads KV cache management and portions of the attention mechanism’s computation to the central processing unit (CPU), offers a potential solution. However, existing scheduling strategies often fail to effectively parallelise CPU and GPU tasks during decoding.

Researchers have introduced APEX, a novel scheduling strategy designed to maximise CPU-GPU parallelism during hybrid LLM inference. Unlike systems relying on static rules or heuristics, APEX dynamically distributes computation across heterogeneous resources by predicting the execution times of both CPU and GPU subtasks. This enables optimal overlap and minimises scheduling overhead.

Evaluations utilising LLaMa-2-7B and LLaMa-3-8B models on Nvidia T4 and A10 GPUs demonstrate substantial performance gains. APEX achieves throughput improvements ranging from 11% to 96% compared to GPU-only schedulers such as VLLM, while maintaining comparable latency. Throughput is a measure of how many tokens (units of text) the system can generate per second.

APEX surpasses the performance of existing hybrid schedulers, delivering up to 49% (T4) and 37% (A10) higher throughput in scenarios requiring extended output sequences. This improvement stems from APEX’s profiling-informed approach, which accurately predicts the execution times of CPU and GPU subtasks, and enables a more granular and efficient allocation of resources. The system’s dynamic scheduling minimises overhead and ensures optimal overlap between heterogeneous compute elements.

Current approaches often struggle to effectively overlap CPU-offloaded tasks with GPU processing during the bandwidth-constrained decoding phase. APEX dynamically dispatches compute across heterogeneous resources, maximising parallelism and mitigating performance bottlenecks, particularly in scenarios involving long-output generation or memory pressure.

Future work will focus on integrating APEX with various LLM frameworks and investigating its performance on different hardware platforms. Developing adaptive scheduling strategies that dynamically adjust to changing workload characteristics is also planned, alongside exploration of reinforcement learning to optimise scheduling decisions and further improve performance.

👉 More information
🗞 Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs
🧠 DOI: https://doi.org/10.48550/arXiv.2506.03296

Tags:

auto-regressive decoding. Heterogeneous Computing hybrid CPU-GPU execution KV cache Large Language Models latency Llama LLM Inference scheduling throughput

Quantum News

Hybrid CPU-GPU Scheduling Boosts Large Language Model Inference Speed.

Optimised Hybrid Scheduling for Large Language Model Inference

Latest Posts by Quantum News:

AQT Arithmos Quantum Technologies Launches Real-World Testing Program, Starting March 31, 2026

Rigetti Computing Announces Date for Q4 & Full-Year 2025 Financial Results

Quantonation Closes €220M Fund, Becoming Largest Dedicated Quantum Investment Firm