Operational Intensity and Capacity Footprint Unlock AI Agent Inference Performance

Researchers are increasingly focused on the infrastructural challenges posed by the growing demands of AI agent inference, particularly concerning memory capacity and bandwidth limitations within datacentres. Yiren Zhao and Junyi Liu, with their respective institutions not stated in this publication.

AI Inference Bottlenecks Operational Intensity and Capacity

Scientists have identified two crucial metrics, Operational Intensity (OI) and Capacity Footprint (CF), to better understand the evolving demands of AI agent inference and overcome limitations of traditional performance analysis methods. This research addresses a critical challenge in the rapidly expanding field of AI: sustaining efficiency and capability as models grow and applications diversify. The team demonstrated that existing roofline analysis, commonly used to assess system performance, fails to fully capture the complexities introduced by memory capacity limitations, a phenomenon they term the “memory capacity wall”. Their work reveals that both OI and CF fluctuate dramatically across different agentic workflows, including chat, coding, web use, and computer use, and base model choices such as GQA/MLA, MoE, and Quantization.

The study meticulously examines how long context Key-Value (KV) caching significantly increases memory demands during the decoding process, pushing systems towards being heavily memory-bound. Researchers quantified these effects using the newly defined OI and CF metrics, providing a more granular understanding of workload characteristics than previously available. This detailed analysis motivates a shift towards disaggregated serving and system-level heterogeneity, advocating for specialized accelerators for prefill and decode stages, enhanced scale-up networking, and the decoupling of memory from compute. The work establishes that these advancements are essential to address the increasing pressures on accelerators, interconnects, and memory systems imposed by large-scale AI agent inference.

Furthermore, the research hypothesizes that co-designing AI agents with hardware, employing multiple inference accelerators within a single system, and disaggregating high-bandwidth, large-capacity memory are foundational steps for adapting to the changing OI/CF profiles of future workloads. Experiments show that the traditional roofline model and metrics like Model FLOPs Utilization (MFU) and Memory Bandwidth Utilization (MBU) often fail to accurately reflect performance bottlenecks when memory capacity is the limiting factor. By introducing OI and CF, the team provides a more complete picture of the AI agent inference system, enabling a more targeted approach to hardware and software optimization. The team’s findings suggest that future AI data centers will require a cohesive integration of heterogeneous components at the datacenter scale to efficiently support the demands of large-scale agentic AI inference. Specifically, the research highlights the importance of considering factors like agentic workflows, base model architectures, and system-level optimizations, including quantization and prefill-decode disaggregation, when designing and scaling AI infrastructure. This work opens new avenues for research into adaptive systems that can dynamically adjust to evolving OI/CF characteristics, ultimately paving the way for more powerful and efficient AI agents.

OI and CF Metrics for AI Inference

Scientists investigated the evolving demands of AI agent inference, identifying critical bottlenecks in memory capacity, bandwidth, and interconnect speed. The research team developed Operational Intensity (OI) and Capacity Footprint (CF) metrics to characterise inference regimes, revealing limitations beyond those captured by traditional roofline analysis, including the memory capacity wall. Experiments employed diverse agentic workflows, chat, coding, web use, and varied base model choices, such as GQA/MLA, MoE, and different quantization levels, to demonstrate how OI and CF can shift dramatically, particularly with long context KV caches driving memory-bound decode operations. The study pioneered a detailed analysis of OI and Decode time, comparing dense models and sparse MoE models for batch sizes of 1 and 16, with model weights visually represented in shaded regions.

Researchers observed that MoE models exhibit significantly higher memory-bound behaviour when considering OI, and actively explored model sparsity across the token sequence dimension to reduce KV cache read sizes. Furthermore, the work quantified the impact of quantization, confirming its effectiveness in reducing compute requirements, memory capacity, and bandwidth needs, directly lowering CF, and noting Nvidia’s NVFP4 and GPT-OSS support for this format. To understand system-level implications, scientists analysed the differing operational intensities of prefill and decode phases, advocating for disaggregated serving with specialised accelerators for each. The team demonstrated that agentic AI requires increased prefill tokens and supports larger context lengths, as evidenced by the complexity of agentic tool definitions and multi-step interactions.

They highlighted that the KV cache has become a primary challenge for both memory capacity and bandwidth, necessitating efficient system-level parallelism and heterogeneous compute, memory, and networking architectures. Researchers further hypothesised that advanced packaging and DRAM die stacking alone will not sustain performance scaling, given the asymmetric growth of compute FLOPs versus memory and interconnect. The study proposes that disaggregating compute and memory via future optical IO technologies, potentially delivering D2D-scale bandwidth and.

Operational Intensity and Capacity Footprint for AI

Scientists introduced Operational Intensity (OI) and Capacity Footprint (CF) as novel metrics to characterise AI agent inference, revealing limitations beyond those identified by traditional roofline analysis. The team measured OI as the number of operations performed per byte of data moved from DRAM, a classic metric used in roofline models, while CF quantified the number of bytes needed per agent request in DRAM for LLM generation. Experiments demonstrated that the product of batch size and CF directly states the capacity requirement on DRAM, providing a crucial understanding of memory demands. Researchers calculated OI for a simple matrix multiplication, Y = WX, resulting in a formula of 2mdL / (md+dL+mL), where m represents rows, d the hidden dimension, and L the sequence length.

Measurements confirm that the capacity requirement, considering only the storage of W, is md B, where B is the batch size, in typical AI inference workloads. In the case of KV caching, the team modified the equation to account for storing K and V, changing CF to 2dL + md B, thereby accurately reflecting the increased memory footprint. These calculations provide a precise quantification of resource usage during inference. Data shows that OI and CF together offer a more complete picture of AI agent inference systems than existing models, addressing limitations in capturing memory capacity constraints.

The study highlights that while the traditional roofline model focuses on compute-bound and memory bandwidth-bound regions, it fails to adequately explain under-utilisation arising from memory capacity limitations, termed the “memory capacity wall”. Results demonstrate that both Model FLOPs Utilization (MFU) and Memory Bandwidth Utilization (MBU) can be low simultaneously, indicating limitations beyond compute and bandwidth. Tests prove that several factors, including agentic workflows, model design, and optimizations, significantly influence OI and CF, potentially shifting workload characteristics. The research recorded token utilisation characteristics across Chatbot, Coding agent, Web-use agent, and Computer use scenarios, revealing substantial variations in OI and CF. These observations motivate disaggregated serving and system-level heterogeneity, including specialised prefill and decode accelerators, broader scale-up networking, and decoupled memory enabled by I/O, paving the way for efficient large-scale agentic AI inference.

OI and CF reveal AI scaling limits

Scientists have identified Operational Intensity (OI) and Capacity Footprint (CF) as key metrics for understanding bottlenecks in AI agent inference workloads. These metrics offer a more detailed analysis than traditional roofline models, particularly regarding the impact of memory capacity limitations encountered in long-context agentic tasks. The research demonstrates how choices in model architecture, workload scenarios, and optimisation techniques can significantly alter the balance between OI and CF, potentially shifting workloads into different system bottleneck regimes. This analysis highlights the need to reconsider current hardware and system designs to sustain the scaling of AI agent systems.

Researchers suggest disaggregated compute, heterogeneous architectures, and workload-aware co-design as crucial pathways for overcoming existing limitations and achieving improved efficiency and capability. The authors acknowledge that their work focuses on current agentic workflows and may not fully capture the complexities of future, yet-to-emerge applications. Future research should explore agent-hardware co-design, incorporating multiple inference accelerators and disaggregated memory systems to adapt to evolving OI/CF characteristics and support the increasing demands of agent memory usage.

👉 More information
🗞 Heterogeneous Computing: The Key to Powering the Future of AI Agent Inference
🧠 ArXiv: https://arxiv.org/abs/2601.22001

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Light-Based Computing Gets a Performance Boost with New System Modelling

Light-Based Computing Gets a Performance Boost with New System Modelling

February 3, 2026
Teams Transform 36 Representations Enabling Accessibility for Mixed-Visual Ability Workforces

Teams Transform 36 Representations Enabling Accessibility for Mixed-Visual Ability Workforces

February 3, 2026
Researchers Uncover Input-Dependent GPU Memory Bugs Eluding Existing Detection Tools

Researchers Uncover Input-Dependent GPU Memory Bugs Eluding Existing Detection Tools

February 3, 2026