Researchers are tackling the considerable computational burden of deploying large AI models (LAIMs) in embodied AI systems, where resource limitations present a significant obstacle. Zhonghao Lyu, Ming Xiao and Mikael Skoglund from KTH Royal Institute of Technology, in collaboration with Merouane Debbah from Khalifa University and CentraleSupelec, University Paris-Saclay, and H. Vincent Poor from Princeton University, present a novel approach to quantization-aware collaborative inference, termed ‘co-inference’. Their work develops a tractable approximation to understand how reducing model precision impacts performance, subsequently deriving fundamental bounds on the trade-off between compression and accuracy. This research is significant because it formulates a joint design for optimising both model compression and computational frequency, ultimately balancing inference quality, latency and energy consumption for embodied AI agents operating at the edge.
Researchers are tackling a critical bottleneck in the development of truly intelligent robots and autonomous systems, the immense computational demands of large artificial intelligence models (LAIMs). These models, increasingly vital for tasks requiring visual understanding, language processing, and complex reasoning, often overwhelm the limited processing power, memory, and battery life of robots operating in the real world.
This work introduces a novel approach to “collaborative inference,” or co-inference, allowing LAIMs to be deployed across multiple devices, the robot itself, nearby edge servers, and the cloud, to balance performance and efficiency. The study centres on a method for intelligently distributing the computational load of these large models, while simultaneously reducing their size through a process called quantization, reducing the precision of the numbers used to represent the model’s parameters.
A key innovation is a new, mathematically tractable approximation that accurately predicts how much performance is lost when a model is quantized, enabling precise control over the trade-off between model size and inference accuracy. This approximation allows researchers to establish clear upper and lower limits on the potential distortion introduced by quantization, dependent on the specific characteristics of the LAIM itself and the chosen bit-width, the number of bits used to represent each parameter.
Building on this foundation, the team formulated a sophisticated design problem, aiming to jointly optimise both the level of quantization applied to the model and the frequency with which computations are performed. The goal is to minimise the upper bound on inference distortion, while ensuring the lower bound remains close, indicating a reliable and accurate estimate of performance loss, all within constraints on latency and energy consumption.
Extensive simulations and real-world experiments using a dedicated testbed validate the accuracy of the distortion approximation and the effectiveness of the proposed joint design in achieving a harmonious balance between inference quality, speed, and energy efficiency for embodied AI systems. Initial evaluations reveal that the proposed distortion approximation closely aligns with the derived rate-distortion bounds, demonstrating a high degree of accuracy in modelling quantization-induced inference errors.
Simulations and real-world testbed experiments consistently demonstrate performance gains in balancing inference quality, latency, and energy consumption within edge embodied AI systems. The study models the statistical distribution of LAIM parameters, preserving sign bits and quantizing magnitudes, assuming an exponential distribution for parameter magnitude with a probability density function of PΘ(θ) = λe−λθ, where θ ≥0 and λ 0 is the distribution parameter.
This assumption is empirically supported across diverse models including ResNet-152, VideoMAE, BERT, BLIP-2, GIT, and GPT-3, with empirical distributions closely matching the exponential model and exhibiting a sharp peak of weight values around zero. The analysis of inference delay reveals that on-agent inference time is calculated as t(b, f) = bNFLOP/bcf, where NFLOP represents the number of floating-point operations for full-precision inference, b is the quantization bit-width, f is the clock frequency, and c is the number of FLOPs per CPU cycle.
Furthermore, the on-server inference delay is determined by t(f) = NFLOP/fc, utilising analogous parameters for the server processor. Energy consumption calculations show that on-agent inference requires e(b, f) = ηbNFLOP/bcfψf², where η is the power usage effectiveness and ψ is a chip-dependent power coefficient. Correspondingly, on-server inference consumes e(f) = ηNFLOP/fcψf².
Total inference delay is expressed as T(b, f, f) = t(b, f) + t(f), and total energy consumption as E(b, f, f) = e(b, f) + e(f). The work highlights that even with coarse-grained frequency configurations, jointly optimising computation and quantization remains crucial for improving LAIM co-inference performance in practical edge embodied AI systems. A tractable approximation for quantization-induced inference distortion underpinned the methodological approach to this work.
This approximation, based on parameter-level perturbations, allowed for the derivation of both lower and upper bounds on the quantization rate-distortion function, thereby characterising its dependence on large model (LAIM) statistics and quantization bit-width. The research then formulated a joint optimisation problem, designing bit-width and computation frequency under both delay and energy constraints to minimise the upper bound on distortion while ensuring the tightness of the corresponding lower bound.
To validate these theoretical developments, extensive simulations were conducted using a setup mirroring that of prior work, employing two Nvidia RTX 3090 GPUs. Maximum clock frequencies were set to 2GHz and 10GHz, with 32 and 128 FLOPs per cycle respectively, while power usage effectiveness (PUE) values of 1 and 2 were tested alongside power coefficients of 2 × 10−29W/(cycle/s)3 and 1 × 10−28W/(cycle/s)3.
The proposed design was rigorously compared against three benchmark schemes: a Proximal Policy Optimisation (PPO)-based design utilising reinforcement learning, a fixed-frequency design optimising only bit-width, and a feasible random design sampling bit-widths and checking for feasibility. Further validation extended to a real-world testbed comprising an NVIDIA Jetson AGX Orin 64GB edge device and a Dell PowerEdge R740 server, connected via a stable 5GHz WLAN network.
Recognising the challenges of precise frequency control on the edge device, three accessible operating profiles, low, medium, and high frequency, were adopted, each representing a feasible setting supported by the Jetson AGX Orin. This allowed for evaluation under delay or energy constraints, reporting co-inference performance in Table I, and demonstrating the practicality of the proposed joint quantization and computation design in edge environments.
The relentless pursuit of artificial intelligence that can genuinely interact with the physical world has always been hampered by a simple truth: brains are remarkably energy-efficient, and current AI systems are not. This work offers a practical step towards bridging that gap by tackling the computational burden of large AI models deployed on robots and other ‘embodied’ agents.
The core insight, optimising how these models are compressed for use on limited hardware, is noteworthy. For years, researchers have focused on simply shrinking models, often sacrificing accuracy in the process. This approach demonstrates a more nuanced strategy, explicitly modelling the trade-off between reducing a model’s size and the resulting loss of information.
By carefully adjusting the precision of the model’s calculations and how often those calculations are performed, it’s possible to achieve significant energy savings without catastrophic performance drops, particularly crucial for edge devices where power and processing are constrained. However, the theoretical bounds established here rely on accurate characterisation of the model’s internal statistics.
Real-world models are complex and constantly evolving, meaning these approximations may not always hold. Furthermore, the experiments, while demonstrating success in simulated and controlled environments, need to be replicated across a wider range of robotic platforms and tasks to confirm its robustness. Looking ahead, this work could inspire new hardware architectures specifically designed to accelerate these ‘co-inference’ strategies. Beyond robotics, the principles of balancing precision and frequency could also be applied to other areas, such as on-device natural language processing and computer vision, bringing more powerful AI capabilities to mobile phones and wearable devices.
👉 More information
🗞 Quantization-Aware Collaborative Inference for Large Embodied AI Models
🧠 ArXiv: https://arxiv.org/abs/2602.13052
