The increasing demand for real-time generative AI, such as in video conferencing and gaming, presents significant challenges for modern computing systems, requiring both immense processing power and strict adherence to timing constraints. Rachid Karami from University of California, Irvine, Rajeev Patwari from Advanced Micro Devices Inc., and Hyoukjun Kwon, also from University of California, Irvine, along with their colleagues, investigate how best to manage these demanding workloads on the latest generation of computer chips. Their research focuses on heterogeneous systems, which combine different types of processors, CPUs, GPUs, and NPUs, to maximise performance, and explores how scheduling tasks across these diverse components impacts real-time responsiveness and the quality of AI-generated content. The team’s comprehensive analysis, conducted on AMD’s Ryzen AI platform, reveals that scheduling decisions dramatically affect performance, with observed differences of over 40% in critical metrics, and demonstrates the necessity for intelligent, dynamic scheduling strategies to unlock the full potential of on-device, real-time generative AI applications

Large Language Model Inference Acceleration Techniques

This document explores techniques for accelerating and optimizing the performance of large language models (LLMs) and other machine learning models, particularly for real-time and interactive applications. The central motivation is to overcome the significant computational demands of these models, enabling their deployment in resource-constrained environments like mobile devices and for applications requiring minimal delay, such as video conferencing and virtual reality. Researchers address key challenges, including the immense computational cost of LLMs, stemming from their billions of parameters, the need for low latency in real-time applications, and limited resources on edge devices. Effectively utilizing diverse hardware components, such as CPUs, GPUs, and specialized neural processing units (NPUs), presents further complexity, demanding careful orchestration and optimization. The inference phase, where a trained model processes new data, is often the primary bottleneck, as it requires repeated calculations for each input token or data point.

Reducing model size and precision through techniques like quantization is crucial, but maintaining accuracy remains paramount. Quantization reduces the number of bits used to represent model weights and activations, thereby decreasing memory usage and computational cost. For example, transitioning from 32-bit floating-point numbers to 8-bit integers can significantly reduce model size with minimal accuracy loss, provided the quantization process is carefully calibrated. Techniques like Activation-Aware Weight Quantization (AWQ) show particular promise, as they focus on preserving the most important activations during quantization, mitigating potential accuracy degradation. Post-training quantization is simpler to implement but may yield lower accuracy than quantization-aware training, where the model is trained with quantized weights and activations from the outset. The choice of quantization strategy depends on the specific model, dataset, and performance requirements.

Model compression methods, including pruning and knowledge distillation, further reduce model size. Pruning removes unimportant connections or neurons from the network, reducing the number of parameters without significantly impacting performance. Knowledge distillation transfers knowledge from a large, complex model (the teacher) to a smaller, more efficient model (the student). This allows the student model to achieve comparable performance to the teacher model with fewer parameters. Heterogeneous execution strategically offloads different parts of the model to the most suitable processor, maximizing performance. For instance, computationally intensive layers might be executed on a GPU, while less demanding layers run on a CPU. The ONNX Runtime framework plays a key role in enabling this flexible execution across various hardware platforms. ONNX (Open Neural Network Exchange) provides a standardized format for representing machine learning models, facilitating interoperability between different frameworks and hardware. Specific execution providers, such as DirectML for Windows and Vitis AI for Xilinx/AMD devices, further enhance performance by leveraging the specific capabilities of the underlying hardware. Dynamic batching adjusts the number of processed items to optimize throughput and latency, balancing the trade-off between processing multiple items simultaneously and minimizing the delay for each individual item.

Speculative decoding predicts future tokens to reduce processing time. This technique involves generating a draft sequence of tokens and then verifying its accuracy in parallel, potentially skipping the need to process all tokens sequentially. Optimizing the storage and retrieval of key-value caches, using techniques like InfiniGen, also improves efficiency. InfiniGen utilizes a novel caching strategy that prioritizes frequently accessed data, reducing memory access latency. Software-hardware co-design focuses on optimizing both the algorithms and the underlying hardware for peak performance. This involves tailoring the algorithms to exploit the specific architectural features of the hardware, such as the number of cores, memory bandwidth, and specialized instructions. Researchers also explore optimizations for the attention mechanism and activation functions, alongside streaming and pipelining techniques for continuous data processing. The attention mechanism, crucial for LLMs, can be computationally expensive; approximations and sparse attention patterns are actively researched. Streaming and pipelining allow for overlapping computation and data transfer, improving overall throughput. These advancements support a variety of applications, demanding increasingly sophisticated optimization strategies.

Video conferencing benefits from improved video quality, reduced latency, and features like virtual backgrounds and AI-powered assistants. Lower latency is critical for natural and engaging conversations, while AI-powered features enhance the user experience. Virtual and extended reality experiences become more immersive with reduced latency and improved performance. High frame rates and low motion-to-photon latency are essential for preventing motion sickness and creating a realistic sense of presence. Gaming experiences are enhanced with reduced lag and improved graphics. Real-time rendering and physics simulations demand significant computational resources, which can be alleviated through model optimization and hardware acceleration. AI-powered virtual assistants provide more natural and intelligent interactions. Natural language understanding and speech recognition require efficient LLMs and optimized inference pipelines. Streaming services deliver higher quality and more reliable live content. Adaptive bitrate streaming and content caching are crucial for ensuring a smooth viewing experience, even under varying network conditions.

Further applications include image and video enhancement, gesture recognition, and real-time object detection for robotics and autonomous systems. Image and video enhancement techniques, such as super-resolution and denoising, rely on computationally intensive deep learning models. Gesture recognition and object detection are essential for human-robot interaction and autonomous navigation. Several frameworks and tools facilitate these advancements. ONNX Runtime serves as a cross-platform inference engine, supporting a wide range of hardware and software. DirectML provides machine learning capabilities for Windows, leveraging the power of DirectX. Vitis AI offers an AI development platform for Xilinx/AMD devices, enabling developers to deploy AI models on FPGAs and adaptive SoCs. Popular machine learning frameworks like TensorFlow and PyTorch are also essential, providing the foundation for model development and training.

MediaPipe provides a framework for building perception pipelines, simplifying the development of real-time computer vision applications. Emerging trends are shaping the future of this field. AI PCs, equipped with dedicated neural processing units (NPUs), promise improved performance and efficiency. NPUs are specifically designed for accelerating deep learning workloads, offering significant performance gains over traditional CPUs and GPUs. Edge computing moves computation closer to the data source, reducing latency and enhancing privacy. Processing data on the edge eliminates the need to transmit data to the cloud, reducing network congestion and improving responsiveness. Heterogeneous computing combines different types of processors for optimal performance, leveraging the strengths of each processor type. Generative AI, powered by LLMs, is driving innovation in content creation, including text, images, and videos. The ability to generate high-quality content on demand has numerous applications, from marketing and advertising to entertainment and education. In essence, this document presents a rapidly evolving landscape where researchers and engineers are actively addressing the computational challenges of LLMs and other AI models. The overarching goal is to make AI more accessible, efficient, and responsive, enabling its widespread adoption across a diverse range of real-world applications.

👉 More information
🗞 Exploring the Dynamic Scheduling Space of Real-Time Generative AI Applications on Emerging Heterogeneous Systems
🧠 DOI: https://doi.org/10.48550/arXiv.2507.14715

Tags:

deadline violation rate dynamic heterogeneous scheduling heterogeneous system-on-chip Large Language Models on-device applications real-time generative AI Ryzen AI time-to-first-token tokens-per-second workload-aware scheduling

Quantum News

Heterogeneous SoCs Effectively Schedule Real-Time Generative AI Workloads

Large Language Model Inference Acceleration Techniques

Latest Posts by Quantum News:

Infleqtion Enables Researchers to Work with Large-Scale Quantum Systems

Classiq Integrates with NVIDIA CUDA-Q to Shorten Iteration Cycles for Quantum Teams

Quantum Delta NL Positions Netherlands in Three European Quantum Pilot Lines