The increasing deployment of artificial intelligence necessitates a move beyond centralised cloud computing for many applications, particularly those demanding low latency or enhanced data privacy. Consequently, researchers are actively investigating the viability of running complex models, such as Large Language Models (LLMs), on edge devices – compact, self-contained computing systems located closer to the data source. A study by Arya, Simmhan, and colleagues from the Indian Institute of Science examines the performance characteristics of LLM inference, the process of using a trained model to generate outputs, on the Jetson Orin AGX, a prevalent edge accelerator.

Their work, entitled ‘Understanding the Performance and Power of LLM Inferencing on Edge Accelerators’, details a comprehensive evaluation of four state-of-the-art LLMs, ranging in size from 2.7 to 32.8 billion parameters, assessing the impact of factors like batch size, sequence length and model quantisation on both computational efficiency and energy consumption. The findings provide valuable data for optimising LLM deployment on resource-constrained edge devices.

This study examines the performance characteristics of Large Language Models (LLMs) when deployed on edge accelerators, specifically the Nvidia Jetson Orin AGX, and details the impact of quantization techniques on energy consumption, inference speed, and model accuracy. Researchers rigorously tested four state-of-the-art models – Meta Llama3, Microsoft Phi-2, Deepseek-R1-Qwen, and Mistral – to determine optimal configurations for resource-constrained environments. The core of the investigation centres on quantization, a process that reduces the numerical precision of model weights and activations, thereby decreasing model size and computational demands.

Results consistently demonstrate that employing INT8 quantization reduces power usage across all tested models when compared to FP16 (full precision), and further reductions in power consumption occur with INT4 quantization, although this often comes at a cost to performance. Notably, Deepseek-R1-Qwen could not execute using FP16 due to memory constraints, highlighting the importance of quantization for resource-limited devices and demonstrating its critical role in enabling LLM deployment on edge hardware.

Throughput, measured as tokens generated per second, generally decreases and latency, the time taken to generate each token, increases as precision lowers, but the Mistral model exhibits resilience to quantization, maintaining comparable throughput and latency to FP16 when utilizing INT8. This suggests a greater robustness to reduced precision compared to other models and indicates that certain model architectures are inherently more amenable to quantization techniques. Llama3, MS-Phi2 and Deepseek-R1-Qwen all show reductions in power and energy consumption when using INT8 and INT4, but at the expense of increased latency and reduced throughput, necessitating careful consideration of application-specific requirements.

Memory footprint consistently decreases with lower precision, which is particularly important for edge devices with limited memory capacity. The study confirms that increasing sequence length negatively impacts token throughput, while quantization can slow down smaller LLMs. Consequently, selecting the optimal quantization level requires careful consideration of the specific model, application requirements, and the trade-off between energy efficiency, inference speed, and resource utilization, and researchers emphasize the need for a holistic approach to optimization. The findings underscore the importance of tailoring LLM configurations to the unique constraints of edge deployments.

INT4 quantization offers further power reductions, but often at the expense of performance. Mistral exhibits remarkable resilience to quantization, maintaining performance close to FP16 when utilizing INT8, suggesting that certain model designs are inherently more robust to reduced precision than others. Conversely, Deepseek-Qwen benefits significantly from INT8 quantization, achieving substantial reductions in both power consumption and energy usage compared to INT4, demonstrating the importance of model-specific optimization strategies.

The findings highlight the importance of considering sequence length and batch size when optimizing LLM inference. Increasing sequence length demonstrably reduces token throughput, while careful selection of batch size can mitigate this effect. These parameters, in conjunction with quantization level, offer a multi-faceted approach to tailoring LLM performance to specific application requirements, and researchers emphasize the need for a comprehensive optimization framework.

Future research should explore techniques for mitigating the performance degradation associated with quantization, such as quantization-aware training and mixed-precision quantization, and investigating the potential of model pruning and knowledge distillation to further reduce model size and computational complexity. Additionally, exploring the use of specialized hardware accelerators designed for quantized models could significantly improve inference performance and energy efficiency. Ultimately, a combination of algorithmic optimization and hardware acceleration will be crucial for unlocking the full potential of LLMs on edge devices.

👉 More information
🗞 Understanding the Performance and Power of LLM Inferencing on Edge Accelerators
🧠 DOI: https://doi.org/10.48550/arXiv.2506.09554

Tags:

Batch size. Edge Computing Jetson Orin AGX Large Language Models latency LLM Inference Perplexity Quantization sequence length throughput

Quantum News

Local LLM Inference on Edge Accelerators: Performance and Efficiency Analysis.

Latest Posts by Quantum News:

Scientists Guide Zapata’s Path to Fault-Tolerant Quantum Systems

NVIDIA’s ALCHEMI Toolkit Links with MatGL for Graph-Based MLIPs

New Consultancy Helps Firms Meet EU DORA Crypto Agility Rules