Modern computer systems increasingly rely on disaggregated chiplets to boost performance, but this approach introduces communication delays within the on-package network, known as the Network-on-Interposer. Arnav Shukla from the Indraprastha Institute of Information Technology Delhi, Harsh Sharma from Washington State University, Srikant Bharadwaj from Microsoft Research, and Vinayak Abrol and Sujay Deb from the Indraprastha Institute of Information Technology Delhi, demonstrate that large-scale model inference generates substantial data movement that significantly increases these delays, potentially violating performance agreements. Their research addresses this challenge by introducing a new method for designing these networks, optimising for both speed and efficiency, and a novel scoring system to predict performance under heavy load. The team’s topology generator, PARL, successfully reduces worst-case slowdown by a factor of 1. 2, offering a substantial improvement over existing designs and redefining how these networks are created for advanced chiplet-based accelerators.

However, this on-package disaggregation introduces latency within the Network-on-Interposer (NoI). Observations reveal that modern large-model inference generates significant parameter and activation movement between HBM and DRAM, creating large, bursty flows that impact the interposer. These memory-driven transfers inflate tail latency and threaten Service Level Agreements (SLAs) across conventional k-ary n-cube NoI topologies. To address this, researchers introduce an Interference Score (IS) that quantifies worst-case interference between memory and compute traffic within the NoI.

Sparse MoE Workloads and NoC Interconnect Design

This research addresses the limitations of current Network-on-Chip (NoC) interconnect designs for large-scale AI models, particularly those employing Mixture-of-Experts (MoE) architectures. The authors demonstrate that traditional, uniform NoC topologies struggle with the irregular communication patterns generated by MoE workloads and the high bandwidth demands of off-chip High Bandwidth Memory (HBM). Simply increasing bandwidth proves insufficient. The team formalizes this problem by introducing an Interference Score to quantify the trade-off between throughput and partition isolation, framing NoC design as a multi-objective optimization problem.

They propose a shift towards non-uniform NoC topologies designed to cluster experts and provide dedicated bandwidth, improving isolation and reducing interference. A Reinforcement Learning (RL) agent, named PARL, explores the design space and discovers topologies that balance throughput and isolation. Simulations demonstrate that PARL-generated topologies outperform traditional NoCs in terms of interference score and overall performance under mixed MoE workloads. This work contributes a theoretical framework for understanding interconnect challenges in sparse MoE models, a novel Interference Score metric, an RL-driven NoC design methodology, and experimental validation of improved performance and isolation. In essence, the paper advocates for specialized, partition-aware topologies that effectively handle the unique communication patterns of large-scale AI models with sparse MoE architectures, highlighting the importance of considering both throughput and isolation.

NoI Synthesis Optimizes Memory Traffic and Latency

Researchers have achieved a significant breakthrough in the design of chiplet-based systems, addressing latency issues within Network-on-Interposer (NoI) architectures. The work demonstrates that modern large-model inference generates substantial memory traffic between High Bandwidth Memory (HBM) and Dynamic Random Access Memory (DRAM), creating contention that negatively impacts tail latency and violates Service Level Agreements (SLAs). To quantify this contention, the team developed an Interference Score (IS) that accurately measures worst-case slowdown under these conditions. The core of this advancement lies in a novel approach to NoI synthesis, formulated as a multi-objective optimization (MOO) problem.

This allows for systematic exploration of the complex trade-offs between performance isolation and maximizing overall system throughput. A topology generator, named PARL (Partition-Aware Reinforcement Learner), navigates this optimization landscape, creating configurations that achieve robust performance isolation without compromising efficiency. Experiments reveal that PARL-generated topologies reduce contention at the memory cut, successfully meeting SLAs while simultaneously reducing worst-case slowdown. Importantly, this performance was achieved while maintaining competitive mean throughput relative to link-rich mesh topologies. These results demonstrate a substantial improvement in NoI design, paving the way for future accelerator designs that move beyond regular fabric approaches and embrace workload-aware, non-uniform architectures.

NoI Design for Sparse AI Workloads

This research demonstrates the limitations of conventional network-on-interposer (NoI) designs when applied to modern, large-scale AI workloads, particularly those involving heterogeneous chiplets and sparse models. The team identified that substantial memory traffic, combined with irregular communication patterns, creates performance bottlenecks that degrade both system throughput and increase latency. These limitations stem from the inherent inability of uniform NoI topologies to efficiently manage the demands of these emerging workloads. To address these challenges, researchers developed a novel approach to NoI design, framing it as a multi-objective optimization problem.

This enabled the systematic exploration of trade-offs between performance isolation and overall system efficiency. The resulting topology generator, PARL, creates sparse, clustered NoI configurations that prioritize bandwidth for specific processing units while maintaining sufficient global connectivity. Results show that PARL-generated topologies reduce worst-case slowdown and meet service level agreements, although with a slight reduction in mean throughput compared to traditional link-rich mesh designs. The authors acknowledge that while PARL improves performance isolation, there is a trade-off with aggregate throughput, and future work may focus on further refining the optimization process to better balance these competing objectives. This research advances interconnect architectures for next-generation heterogeneous AI systems, demonstrating the need for workload-aware, non-uniform NoI designs that move beyond conventional, regular fabric approaches.

👉 More information
🗞 Taming the Tail: NoI Topology Synthesis for Mixed DL Workloads on Chiplet-Based Accelerators
🧠 ArXiv: https://arxiv.org/abs/2510.24113

Tags:

chiplet-based systems HBM/DRAM interference score k-ary n-cube topologies multi-objective optimization Network-on-Interposer NoI synthesis Reinforcement Learning service level agreements

Noi Topology Synthesis Tames Tail Latency for Chiplet-Based Accelerators and Mixed Workloads

Sparse MoE Workloads and NoC Interconnect Design

NoI Synthesis Optimizes Memory Traffic and Latency

NoI Design for Sparse AI Workloads

Rohail T.

Latest Posts by Rohail T.:

Accurate Quantum Sensing Now Accounts for Real-World Limitations

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently