The increasing prevalence of large language models (LLMs), exemplified by applications such as ChatGPT, necessitates efficient systems for their deployment. These models process requests in an autoregressive manner, generating output sequentially based on previous tokens, which presents unique challenges for high-throughput and low-latency inference. Recent years have seen the development of specialised inference systems designed to address these challenges, employing techniques ranging from optimised kernel design to sophisticated memory management. A comprehensive analysis of these systems, however, has been lacking. James Pan and Guoliang Li present “A Survey of LLM Inference Systems”, a detailed review of the operators, algorithms, and techniques used in modern LLM inference, examining how these components combine to form both single-replica and multi-replica deployments, including disaggregated and serverless architectures, and outlining areas for future research.
Recent years demonstrate a pronounced increase in publications concerning Large Language Model (LLM) inference systems, signalling accelerating research activity within this critical field of computer science. A review of the literature reveals a single publication in 2020, expanding to six in 2022, nine in 2023, and reaching thirteen in 2024, with two reported for 2025, culminating in a total of thirty-three publications examined. This trajectory indicates a rapidly maturing area, driven by both academic inquiry and the increasing demands of practical application across diverse industries.
The surveyed literature consistently focuses on optimising the unique challenges presented by the autoregressive nature of LLM request processing. Autoregressive models generate output sequentially, predicting each subsequent ‘token’ – the basic unit of text – based on preceding tokens. This necessitates efficient techniques for request handling, optimisation, and execution to achieve acceptable performance and scalability. Researchers actively address kernel design, batching, and scheduling to improve performance, alongside sophisticated memory management strategies like paged memory and eviction policies, minimising memory footprint and reducing access times.
A central theme emerging from the reviewed publications highlights the reliance on load prediction, adaptive mechanisms, and cost reduction as foundational principles for overcoming the inherent difficulties of autoregressive generation. Effective LLM inference systems depend on accurately forecasting workload demands, dynamically adjusting resources to meet fluctuating needs, and minimising computational expense to ensure scalable deployment. These elements are crucial for achieving high throughput, low latency, and optimal resource utilisation in demanding real-world applications.
The survey details how individual techniques coalesce into complete inference systems, encompassing both single-replica and multi-replica configurations, each offering distinct advantages depending on the specific deployment scenario. Research also explores disaggregated systems, which offer granular control over resource allocation, and serverless architectures, leveraging shared infrastructure for efficient scaling.
Researchers investigate various techniques to improve kernel design, enabling faster processing of LLM operations and reducing latency. Batching strategies group multiple requests together, increasing throughput and improving resource utilisation. Scheduling algorithms prioritise and manage requests efficiently, ensuring optimal performance under varying workloads.
The surveyed literature also highlights the importance of hardware acceleration, utilising specialized processors like Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) to accelerate LLM computations. Researchers explore different quantization techniques, reducing the precision of model parameters to reduce memory usage and improve inference speed. Model parallelism and data parallelism distribute the model and data across multiple devices, enabling faster training and inference. Pruning techniques remove redundant parameters from the model, reducing its size and complexity.
Future research directions focus on developing more efficient and scalable inference algorithms, exploring novel hardware architectures, and optimising the entire LLM pipeline. Researchers aim to reduce the computational cost and memory footprint of LLMs, enabling their deployment on resource-constrained devices. They also investigate techniques for improving the robustness and reliability of LLMs, ensuring consistent performance under varying conditions. The development of automated optimisation tools and frameworks will further accelerate the deployment and scaling of LLMs in real-world applications.
The increasing demand for LLMs across various industries drives the need for continuous innovation in inference techniques and hardware acceleration. Researchers and engineers collaborate to overcome the challenges and unlock the full potential of LLMs, enabling them to power a wide range of applications, from natural language processing and machine translation to image recognition and robotics. The ongoing research and development efforts promise to deliver more efficient, scalable, and reliable LLM solutions, transforming the way we interact with technology and solve complex problems.
👉 More information
🗞 A Survey of LLM Inference Systems
🧠 DOI: https://doi.org/10.48550/arXiv.2506.21901
