Researchers are tackling the significant energy demands of large language model (LLM) inference by exploring spiking neural networks (SNNs). Zhanglu Yan, Kaiwen Tang, and Zixuan Zhu, from the National University of Singapore and the Shanghai Advanced Research Institute, alongside et al., present Matterhorn, a novel spiking transformer architecture designed to address limitations in current SNN energy evaluations. Their work moves beyond simply counting operations to consider the substantial energy costs associated with data movement in real-world hardware. Matterhorn integrates a masked time-to-first-spike (M-TTFS) encoding method and a memristive synapse unit to minimise spike communication and weight access, achieving a 2.31times improvement in energy efficiency and a new state-of-the-art result on the GLUE benchmark with a 1.42% accuracy increase over existing SNNs.
The research addresses a critical gap in current spiking neural network (SNN) evaluations, which often overlook the substantial energy costs associated with data movement, accounting for up to 80% of total energy consumption.
Matterhorn integrates a masked time-to-first-spike (M-TTFS) encoding method and a memristive synapse unit (MSU) to tackle these challenges. M-TTFS strategically reassigns the zero-energy silent state to the most frequent membrane potential, aligning the coding scheme with data distribution and minimising spike movement energy without compromising information.
The team achieved a reduction in spike movement by employing a masking strategy that inhibits spikes at the most frequent firing time, denoted as Imax. This innovative approach reassigns the silent state to represent the most common activation values, optimising energy usage. Furthermore, a ‘dead zone’ strategy is proposed to maximise sparsity by mapping values within a defined range to the silent state, effectively reducing both spike rates and computational demands.
At the hardware level, the MSU utilises compute-in-memory (CIM) technology, performing analog integration directly within memory and eliminating weight access costs. Experiments conducted on the GLUE benchmark demonstrate Matterhorn’s state-of-the-art performance, surpassing existing SNNs by 1.42% in average accuracy.
Crucially, the research delivers a 2.31times improvement in energy efficiency, highlighting the effectiveness of the combined M-TTFS encoding and MSU implementation. Unlike conventional methods that focus solely on reducing multiply-and-accumulate operations, Matterhorn directly addresses the dominant energy drain of data movement, paving the way for truly energy-efficient SNNs.
The study reveals that traditional TTFS encoding assigns the silent state to rare outliers, wasting energy on infrequent occurrences. Matterhorn’s M-TTFS method, however, optimises the spike order for hardware efficiency without sacrificing information, reducing spike rates from 4.07% to 2.77% on the SST-2 benchmark.
By silencing the most frequent activation region with a dead zone strategy, the research further reduces spike movement energy and computational overhead, achieving an overall spike rate of 1.65% and a 2.46times improvement in inter-core spike movement energy compared to traditional TTFS methods. This work opens new avenues for developing sustainable and scalable LLM inference solutions.
Masked time-to-first-spike encoding and memristive synapse unit design for energy-efficient spiking neural networks offer promising results
Scientists developed Matterhorn, a spiking transformer designed to address the limitations of current energy evaluations for spiking neural networks (SNNs). The research team recognised that existing evaluations focus on accumulate operations, neglecting substantial hardware costs like data movement, which can account for up to 80% of total energy consumption.
To overcome this, they engineered a novel masked time-to-first-spike (M-TTFS) encoding method and a memristive synapse unit (MSU) to minimise energy expenditure. The M-TTFS encoding method employs a masking strategy that reassigns the zero-energy silent state, a spike train consisting entirely of zeros, to the most frequent membrane potential, rather than the lowest.
This innovative approach aligns the coding scheme with the underlying data distribution, reducing spike movement energy without compromising information integrity. Furthermore, the study pioneered a ‘dead zone’ strategy, maximising sparsity by mapping all values within a defined range to the silent state.
At the hardware level, researchers implemented the MSU, utilising compute-in-memory (CIM) technology to perform analog integration directly within the memory itself. This configuration effectively eliminates weight access costs, a significant contributor to energy consumption in traditional SNNs. Experiments employed the GLUE benchmark to assess Matterhorn’s performance, demonstrating a 1.42% improvement in average accuracy compared to existing SNNs.
Crucially, Matterhorn delivered a 2.31times improvement in energy efficiency, validating the effectiveness of the proposed encoding and hardware innovations. The team measured the energy breakdown of state-of-the-art spiking transformers on a commercial 22nm process, revealing that ACC computing energy accounts for only 12, 20% of the total energy.
Conversely, spike transfers constituted 42, 55% and weight access 27, 32%, highlighting the critical need to address data movement energy. This work demonstrates that by temporally expanding data into sparse binary spike trains, and optimising the encoding scheme, SNNs can achieve substantial energy savings.
Matterhorn achieves efficient spiking neural network inference via masked encoding and compute-in-memory architectures
Scientists achieved a new state-of-the-art performance in spiking neural network (SNN) inference with Matterhorn, a spiking transformer designed for energy efficiency. The research addresses a critical gap in SNN energy evaluations, which traditionally focus solely on accumulate operations and neglect the substantial costs associated with data movement, often comprising up to 80% of total energy consumption.
Matterhorn integrates a masked time-to-first-spike (M-TTFS) encoding method and a memristive synapse unit (MSU) to mitigate these issues. M-TTFS employs a masking strategy that assigns the zero-energy silent state to the most frequent membrane potential, aligning the coding scheme with data distribution and minimizing spike movement energy without compromising information.
The team further implemented a ‘dead zone’ strategy, maximizing sparsity by mapping values within a defined range to the silent state. At the hardware level, the MSU utilizes compute-in-memory (CIM) to perform analog integration directly within memory, eliminating weight access overhead. Experiments revealed that Matterhorn surpasses existing SNNs by 1.42% in average accuracy on the GLUE benchmark, while simultaneously delivering a 2.31times improvement in energy efficiency.
The MSU processes M-TTFS encoded spikes through Vector-Matrix Multiplication (VMM) in three stages, decoding input spikes into log T-bit integers to maximize throughput. The team measured that the raw synaptic result, R, is computed as R = γ · 2 · RCIM − X ai, where RCIM represents the raw output from crossbar accumulation, P ai is the sum of input activations, and γ is a trainable scaling parameter.
Data shows that the digital energy component, Edigital, accounts for spike data movement, spike generation, and static leakage, calculated as Edigital = BSCo h CiT(srEspike move + Eleak) + T · (ECMP + ERead th). The analog component, Eanalog, encompasses mixed-signal multiply-accumulate operations and digital mapping overhead, quantified as Eanalog = BS h T · Ci · Esum + Co · log T · (Ci · Eavg MAC + EACC) + Emap.
On the GLUE benchmark, Matterhorn achieved an average accuracy of 85.87% with a dead zone radius of k = 0, a 2.65% improvement over previous 1-bit baselines. Specifically, on the RTE task, Matterhorn (k = 0) reached 72.56%, exceeding Spiking Otters by over 3.6% and approaching the performance of full-precision BERT.
Masked time-to-first-spike encoding and compute-in-memory integration for efficient language processing offer promising results
Scientists have developed Matterhorn, a novel spiking transformer designed to improve the energy efficiency of large language model inference. The research addresses a key limitation of current spiking neural network evaluations, which often overlook the substantial energy costs associated with data movement.
Matterhorn integrates a masked time-to-first-spike (M-TTFS) encoding method and a memristive synapse unit (MSU) to minimise spike transmission and eliminate weight access overhead, respectively. The M-TTFS method strategically remaps the most frequent membrane potential to the zero-energy silent state, aligning the coding scheme with data distribution and reducing energy consumption.
Furthermore, a ‘dead zone’ strategy enhances sparsity by mapping values within a specific range to the silent state. At the hardware level, the MSU employs compute-in-memory (CIM) to perform analog integration directly within the memory, thereby removing the energy demands of weight access. On the GLUE benchmark, Matterhorn achieved an average accuracy of 84.64%, exceeding existing spiking transformers by 1.42% and delivering a 2.31-fold improvement in energy efficiency, consuming only 6.14 mJ with the MSU.
The significance of these findings lies in demonstrating a pathway towards more energy-efficient artificial intelligence systems. By reducing both spike movement and weight access costs, Matterhorn offers a promising solution for deploying large language models on resource-constrained devices. The model’s compact footprint, fitting within a 120 mm2 area, suggests physical viability for chip implementation.
The authors acknowledge that their energy evaluations are based on specific hardware assumptions and may vary with different implementations. Future research could explore the application of Matterhorn to other natural language processing tasks and investigate its performance on diverse hardware platforms.
👉 More information
🗞 Matterhorn: Efficient Analog Sparse Spiking Transformer Architecture with Masked Time-To-First-Spike Encoding
🧠 ArXiv: https://arxiv.org/abs/2601.22876
