Scientists are addressing the immense computational demands of large language models (LLMs) with a novel processing-in-memory (PIM) accelerator called PRIMAL. Yue Jiet Chong, Yimin Wang, and Zhen Wu from the National University of Singapore, together with Xuanyao Fong et al., designed PRIMAL to enhance both the speed and energy efficiency of LLM inference through low-rank adaptation (LoRA). The system integrates heterogeneous PIM processing elements (PEs) with a SRAM reprogramming and power gating (SRPG) scheme, enabling pipelined LoRA updates and dramatically reducing power consumption. By optimizing dataflow and minimizing communication overhead, PRIMAL demonstrably outperforms the Nvidia H100 GPU with LoRA rank 8 on the Llama-13B model, offering a promising pathway toward more sustainable and accessible AI applications.
PRIMAL achieves a 5× increase in throughput and a 25× improvement in energy efficiency compared to the Nvidia H100 when processing LoRA rank 8 (Q, V) on Llama-13B. The architecture features heterogeneous PEs interconnected via a 2D-mesh inter-PE computational network (IPCN), facilitating efficient parallel processing and reduced data movement. This innovative design represents a significant step forward in hardware acceleration for LLMs, addressing the growing need for faster and energy-efficient AI computation.
At the core of PRIMAL is a chiplet-based architecture composed of multiple compute tiles (CTs). Each CT houses the 2D-mesh IPCN, which dynamically orchestrates dataflow and executes data multiply-accumulate (DMAC) operations critical for attention score computations. These are complemented by static weight multiply-accumulate (SMAC) operations performed by the PEs, which leverage both non-volatile RRAM analog compute-in-memory (RRAM-ACIM) and volatile SRAM digital compute-in-memory (SRAM-DCIM) macros. The RRAM-ACIM stores large pre-trained weights, while the SRAM-DCIM efficiently handles smaller, frequently updated LoRA matrices, creating a synergistic compute system.
A key innovation is the SRAM reprogramming and power gating (SRPG) scheme, which enables pipelined LoRA updates and achieves sub-linear power scaling by intelligently overlapping reconfiguration with computation and selectively gating idle resources. This reduces reconfiguration overhead, maximizes resource utilization, and significantly boosts both performance and energy efficiency. Spatial mapping and dataflow orchestration were meticulously optimized to further minimize communication overhead, ensuring efficient data movement between PEs and across the IPCN.
Additionally, the researchers implemented a sophisticated mapping strategy, co-locating weight matrices with their corresponding intermediate data on the PE crossbar arrays. This approach minimizes data transfer distances and maximizes computational throughput. PRIMAL operates with 64-bit width, 1 GHz frequency, 32×32 IPCN dimension, 1024 PEs per compute tile, and 256×256 RRAM-ACIM arrays paired with 256×64 SRAM-DCIM arrays. This work paves the way for deploying LLMs on edge devices and in data centers with significantly reduced power consumption and enhanced performance.
PRIMAL PIM architecture for LoRA LLM inference significantly
PRIMAL accelerator boosts LLM inference speed
Scientists have developed PRIMAL, a processing-in-memory (PIM) accelerator tailored for large language model (LLM) inference using low-rank adaptation (LoRA). The system integrates heterogeneous PIM processing elements (PEs) interconnected through a 2D-mesh inter-PE computational network (IPCN), enabling efficient computation and high throughput. A novel SRAM reprogramming and power gating (SRPG) scheme allows pipelined LoRA updates and sub-linear power scaling by overlapping reconfiguration with computation and gating idle resources, maximizing energy efficiency.
Experiments demonstrate PRIMAL’s impressive performance. For Llama-13B with 1024×1024 input/output matrices and LoRA rank 8 (Q,V), it achieves a throughput of 966.32 tokens per second and an average energy efficiency of 433.33 tokens/J. Scaling to larger 2048×2048 matrices reduces throughput to 565.46 tokens/s while maintaining strong energy efficiency at 253.57 tokens/J, highlighting the system’s scalability. PRIMAL’s optimized spatial mapping and dataflow orchestration minimize communication overhead, contributing to rapid processing. For Llama 3.2-1B, the Time-To-First-Token (TTFT) is 0.370 s (1024×1024) and 1.192 s (2048×2048), while the Inter-token Latency (ITL) is 1.708 ms, demonstrating fast decoding and low-latency inference. The cyclic placement strategy ensures balanced scratchpad usage, sustaining throughput even for long-context sequences.
Compared to the Nvidia H100, PRIMAL achieves a 1.5× throughput improvement and a 25× energy efficiency gain (9.85 tokens/J) for LoRA rank 8 on Llama-13B (2048×2048, batch 1). Hardware-software co-verification shows that the RRAM-ACIM macro consumes 120 W and occupies 0.1442 mm², while the SRAM-DCIM macro consumes 950 W and occupies 0.035 mm². The total power across all hardware macros is 1215 W, with a total area of 0.2212 mm² per chiplet, demonstrating a compact and highly efficient accelerator design. PRIMAL establishes a promising pathway for scalable, energy-efficient, and high-throughput LLM inference.
PRIMAL accelerator boosts LLM inference efficiency significantly
Scientists have developed PRIMAL, a processing-in-memory (PIM) accelerator designed to run large language models (LLMs) efficiently using low-rank adaptation (LoRA). The architecture integrates heterogeneous PIM processing elements connected via a 2D-mesh inter-PE computational network, enabling faster and more energy-efficient inference. Key performance gains were achieved through a novel SRAM reprogramming and power gating (SRPG) scheme, which supports pipelined LoRA updates while optimising resource allocation to reduce power consumption. Evaluations show that PRIMAL delivers a 1.5× throughput improvement and a 25× increase in energy efficiency compared to the Nvidia H100 when processing Llama-13B with LoRA rank 8 (Q,V).
The design is highly scalable, with SRPG enabling support for even larger LLMs with minimal additional power requirements. However, the authors note that the RRAM-ACIM macro currently dominates the chip area due to the complexity of integrated analog operations and the inclusion of digital-to-analog and analog-to-digital converters. Future research will likely focus on optimising this macro to reduce its footprint and further enhance system efficiency. Overall, PRIMAL establishes a promising pathway for sustainable, scalable, and high-performance LLM inference, addressing the growing energy demands of modern large-scale language models.
👉 More information
🗞 PRIMAL: Processing-In-Memory Based Low-Rank Adaptation for LLM Inference Accelerator
🧠 ArXiv: https://arxiv.org/abs/2601.13628
