Scientists are addressing the immense computational demands of large language models (LLMs) with a novel processing-in-memory (PIM) accelerator called PRIMAL. Yue Jiet Chong, Yimin Wang, and Zhen Wu from the National University of Singapore, together with Xuanyao Fong et al., designed PRIMAL to enhance both the speed and energy efficiency of LLM inference through low-rank adaptation (LoRA). The system integrates heterogeneous PIM processing elements (PEs) with a SRAM reprogramming and power gating (SRPG) scheme, enabling pipelined LoRA updates and dramatically reducing power consumption. By optimizing dataflow and minimizing communication overhead, PRIMAL demonstrably outperforms the Nvidia H100 GPU with LoRA rank 8 on the Llama-13B model, offering a promising pathway toward more sustainable and accessible AI applications.

PRIMAL achieves a 5× increase in throughput and a 25× improvement in energy efficiency compared to the Nvidia H100 when processing LoRA rank 8 (Q, V) on Llama-13B. The architecture features heterogeneous PEs interconnected via a 2D-mesh inter-PE computational network (IPCN), facilitating efficient parallel processing and reduced data movement. This innovative design represents a significant step forward in hardware acceleration for LLMs, addressing the growing need for faster and energy-efficient AI computation.

At the core of PRIMAL is a chiplet-based architecture composed of multiple compute tiles (CTs). Each CT houses the 2D-mesh IPCN, which dynamically orchestrates dataflow and executes data multiply-accumulate (DMAC) operations critical for attention score computations. These are complemented by static weight multiply-accumulate (SMAC) operations performed by the PEs, which leverage both non-volatile RRAM analog compute-in-memory (RRAM-ACIM) and volatile SRAM digital compute-in-memory (SRAM-DCIM) macros. The RRAM-ACIM stores large pre-trained weights, while the SRAM-DCIM efficiently handles smaller, frequently updated LoRA matrices, creating a synergistic compute system.

A key innovation is the SRAM reprogramming and power gating (SRPG) scheme, which enables pipelined LoRA updates and achieves sub-linear power scaling by intelligently overlapping reconfiguration with computation and selectively gating idle resources. This reduces reconfiguration overhead, maximizes resource utilization, and significantly boosts both performance and energy efficiency. Spatial mapping and dataflow orchestration were meticulously optimized to further minimize communication overhead, ensuring efficient data movement between PEs and across the IPCN.

Additionally, the researchers implemented a sophisticated mapping strategy, co-locating weight matrices with their corresponding intermediate data on the PE crossbar arrays. This approach minimizes data transfer distances and maximizes computational throughput. PRIMAL operates with 64-bit width, 1 GHz frequency, 32×32 IPCN dimension, 1024 PEs per compute tile, and 256×256 RRAM-ACIM arrays paired with 256×64 SRAM-DCIM arrays. This work paves the way for deploying LLMs on edge devices and in data centers with significantly reduced power consumption and enhanced performance.

PRIMAL PIM architecture for LoRA LLM inference significantly

Scientists developed PRIMAL, a processing-in-memory (PIM) accelerator for large language model (LLM) inference leveraging low-rank adaptation (LoRA). The architecture integrates heterogeneous PIM processing elements (PEs) connected via a 2D-mesh inter-PE computational network (IPCN), enabling parallel execution of LLM layers. A novel SRAM reprogramming and power gating (SRPG) scheme overlaps LoRA updates with computation and selectively gates idle resources, achieving sub-linear power scaling and pipelined LoRA updates that significantly reduce energy consumption during model adaptation.

The team applied a hardware-software co-verification methodology, designing digital components in Verilog HDL, verifying them rigorously, and performing synthesis with Synopsys Design Compiler followed by place-and-route with Cadence Innovus. Power and area metrics for scratchpad memory macros were obtained via CACTI, while other macros were modelled in software, enabling a comprehensive evaluation. Inference was emulated using a cycle-accurate, instruction-level simulator based on the IPCN instruction set and spatial mapping scheme for precise performance analysis.

Experimental results show that PRIMAL achieves 1.5× throughput and 25× energy efficiency (9.85 tokens/J versus 0.4 tokens/J) compared to the Nvidia H100 for LoRA rank 8 (Q, V) on Llama-13B with a 2048×2048 batch. The SRPG scheme actively power-gates IPCN and RRAM-ACIM macros in idle compute tiles, while keeping SRAMs and scratchpad memory powered to preserve LoRA weights and context data for KV caching, enabling continuous operation. SRPG implementation provides up to 80% power savings versus a baseline without power gating, demonstrating scalability for larger LLMs.

Hardware analysis reveals that the RRAM-ACIM macro dominates area due to integrated analog SMAC operations, while the SRAM-DCIM macro consumes more power because of increased digital switching activity. Overall, PRIMAL establishes a highly scalable, energy-efficient architecture capable of supporting large LLMs with LoRA, leveraging CIM energy efficiency and LoRA adaptability for optimized inference performance.

PRIMAL accelerator boosts LLM inference speed

Scientists have developed PRIMAL, a processing-in-memory (PIM) accelerator tailored for large language model (LLM) inference using low-rank adaptation (LoRA). The system integrates heterogeneous PIM processing elements (PEs) interconnected through a 2D-mesh inter-PE computational network (IPCN), enabling efficient computation and high throughput. A novel SRAM reprogramming and power gating (SRPG) scheme allows pipelined LoRA updates and sub-linear power scaling by overlapping reconfiguration with computation and gating idle resources, maximizing energy efficiency.

Experiments demonstrate PRIMAL’s impressive performance. For Llama-13B with 1024×1024 input/output matrices and LoRA rank 8 (Q,V), it achieves a throughput of 966.32 tokens per second and an average energy efficiency of 433.33 tokens/J. Scaling to larger 2048×2048 matrices reduces throughput to 565.46 tokens/s while maintaining strong energy efficiency at 253.57 tokens/J, highlighting the system’s scalability. PRIMAL’s optimized spatial mapping and dataflow orchestration minimize communication overhead, contributing to rapid processing. For Llama 3.2-1B, the Time-To-First-Token (TTFT) is 0.370 s (1024×1024) and 1.192 s (2048×2048), while the Inter-token Latency (ITL) is 1.708 ms, demonstrating fast decoding and low-latency inference. The cyclic placement strategy ensures balanced scratchpad usage, sustaining throughput even for long-context sequences.

Compared to the Nvidia H100, PRIMAL achieves a 1.5× throughput improvement and a 25× energy efficiency gain (9.85 tokens/J) for LoRA rank 8 on Llama-13B (2048×2048, batch 1). Hardware-software co-verification shows that the RRAM-ACIM macro consumes 120 W and occupies 0.1442 mm², while the SRAM-DCIM macro consumes 950 W and occupies 0.035 mm². The total power across all hardware macros is 1215 W, with a total area of 0.2212 mm² per chiplet, demonstrating a compact and highly efficient accelerator design. PRIMAL establishes a promising pathway for scalable, energy-efficient, and high-throughput LLM inference.

PRIMAL accelerator boosts LLM inference efficiency significantly

Scientists have developed PRIMAL, a processing-in-memory (PIM) accelerator designed to run large language models (LLMs) efficiently using low-rank adaptation (LoRA). The architecture integrates heterogeneous PIM processing elements connected via a 2D-mesh inter-PE computational network, enabling faster and more energy-efficient inference. Key performance gains were achieved through a novel SRAM reprogramming and power gating (SRPG) scheme, which supports pipelined LoRA updates while optimising resource allocation to reduce power consumption. Evaluations show that PRIMAL delivers a 1.5× throughput improvement and a 25× increase in energy efficiency compared to the Nvidia H100 when processing Llama-13B with LoRA rank 8 (Q,V).

The design is highly scalable, with SRPG enabling support for even larger LLMs with minimal additional power requirements. However, the authors note that the RRAM-ACIM macro currently dominates the chip area due to the complexity of integrated analog operations and the inclusion of digital-to-analog and analog-to-digital converters. Future research will likely focus on optimising this macro to reduce its footprint and further enhance system efficiency. Overall, PRIMAL establishes a promising pathway for sustainable, scalable, and high-performance LLM inference, addressing the growing energy demands of modern large-scale language models.

👉 More information
🗞 PRIMAL: Processing-In-Memory Based Low-Rank Adaptation for LLM Inference Accelerator
🧠 ArXiv: https://arxiv.org/abs/2601.13628

Tags:

2D-mesh inter-PE computational network dataflow orchestration! large language model Llama-13B Lora low-rank adaptation PIM inference accelerator Processing-in-Memory SRAM reprogramming

Primal Achieves Faster Llama-13b Inference with LoRA Rank 8 on PIM

PRIMAL PIM architecture for LoRA LLM inference significantly

PRIMAL accelerator boosts LLM inference speed

PRIMAL accelerator boosts LLM inference efficiency significantly

Rohail T.

Latest Posts by Rohail T.:

Asymmetry Boosts Quantum Walker Delocalization and Position

Robots Learn to Grasp Objects Using Vision and Simulation

Deletion Requests Allow Reconstruction of 90% of Data