Revolutionising Language Models: MatMul-Free Method Achieves High Performance with 61% Less Memory Usage

Researchers from the University of California, Santa Cruz, Soochow University, University of California, Davis, and LuxiTech have developed a scalable language model that eliminates the need for matrix multiplication (MatMul), a computationally expensive operation. The team’s MatMul-free models maintain strong performance at billion-parameter scales, reducing memory usage by up to 61% during training and more than 10 times during inference. The researchers also built a custom hardware solution on a Field-Programmable Gate Array (FPGA) to process billion-parameter scale models at 13W, moving language models closer to brain-like efficiency.

Introduction to MatMul-free Language Modeling

Matrix multiplication (MatMul) is a fundamental operation in most neural networks, including large language models (LLMs). However, it is also a significant contributor to the computational cost of these models, particularly as they scale to larger embedding dimensions and context lengths. In a recent study, researchers from the University of California, Santa Cruz, Soochow University, University of California, Davis, and LuxiTech have demonstrated that it is possible to eliminate MatMul operations from LLMs while maintaining strong performance at billion-parameter scales.

The MatMul-free Approach

The researchers developed a scalable MatMul-free language model (Matmul-free LM) by using additive operations in dense layers and element-wise Hadamard products for self-attention-like functions. Specifically, they used ternary weights to eliminate MatMul in dense layers, similar to binary and ternary neural networks (BNNs and TNNs). To remove MatMul from self-attention, they optimized the Gated Recurrent Unit (GRU) to rely solely on element-wise products. This model competes with state-of-the-art Transformers while eliminating all MatMul operations.

Hardware Efficiency and Performance

To quantify the hardware benefits of lightweight models, the researchers provided an optimized GPU implementation in addition to a custom FPGA accelerator. By using fused kernels in the GPU implementation of the ternary dense layers, training was accelerated by 25.6%, and memory consumption was reduced by up to 61.0% over an unoptimized baseline on GPU. Furthermore, by employing lower-bit optimized CUDA kernels, inference speed was increased by 4.57 times, and memory usage was reduced by a factor of 10 when the model was scaled up to 13B parameters.

FPGA Implementation and Results

The researchers deployed the RTL implementation of the MatMul-free token generation core on a D5005 Stratix 10 programmable acceleration card (PAC) in the Intel FPGA Devcloud. The core completed a forward-pass of a block in 43ms at d = 512 and achieved a clock rate of 60MHz. The single core implementation exhibited extremely low dynamic power that was hardly distinguishable from power measured while the core was inactive.

Conclusion and Future Directions

The study demonstrated the feasibility and effectiveness of the first scalable MatMul-free language model. This work challenges the paradigm that MatMul operations are indispensable for building high-performing language models and paves the way for the development of more efficient and hardware-friendly architectures. However, one limitation of the work is that the MatMul-free LM has not been tested on extremely large-scale models (e.g., 100B+ parameters) due to computational constraints. This work serves as a call to action for institutions and organizations that have the resources to build the largest language models to invest in accelerating lightweight models.

More information
External Link: Click Here For More
Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

December 29, 2025
Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

December 28, 2025
Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

December 27, 2025