Researchers from the University of California, Santa Cruz, Soochow University, University of California, Davis, and LuxiTech have developed a scalable language model that eliminates the need for matrix multiplication (MatMul), a computationally expensive operation. The team’s MatMul-free models maintain strong performance at billion-parameter scales, reducing memory usage by up to 61% during training and more than 10 times during inference. The researchers also built a custom hardware solution on a Field-Programmable Gate Array (FPGA) to process billion-parameter scale models at 13W, moving language models closer to brain-like efficiency.
Introduction to MatMul-free Language Modeling
Matrix multiplication (MatMul) is a fundamental operation in most neural networks, including large language models (LLMs). However, it is also a significant contributor to the computational cost of these models, particularly as they scale to larger embedding dimensions and context lengths. In a recent study, researchers from the University of California, Santa Cruz, Soochow University, University of California, Davis, and LuxiTech have demonstrated that it is possible to eliminate MatMul operations from LLMs while maintaining strong performance at billion-parameter scales.
The MatMul-free Approach
The researchers developed a scalable MatMul-free language model (Matmul-free LM) by using additive operations in dense layers and element-wise Hadamard products for self-attention-like functions. Specifically, they used ternary weights to eliminate MatMul in dense layers, similar to binary and ternary neural networks (BNNs and TNNs). To remove MatMul from self-attention, they optimized the Gated Recurrent Unit (GRU) to rely solely on element-wise products. This model competes with state-of-the-art Transformers while eliminating all MatMul operations.
Hardware Efficiency and Performance
To quantify the hardware benefits of lightweight models, the researchers provided an optimized GPU implementation in addition to a custom FPGA accelerator. By using fused kernels in the GPU implementation of the ternary dense layers, training was accelerated by 25.6%, and memory consumption was reduced by up to 61.0% over an unoptimized baseline on GPU. Furthermore, by employing lower-bit optimized CUDA kernels, inference speed was increased by 4.57 times, and memory usage was reduced by a factor of 10 when the model was scaled up to 13B parameters.
FPGA Implementation and Results
The researchers deployed the RTL implementation of the MatMul-free token generation core on a D5005 Stratix 10 programmable acceleration card (PAC) in the Intel FPGA Devcloud. The core completed a forward-pass of a block in 43ms at d = 512 and achieved a clock rate of 60MHz. The single core implementation exhibited extremely low dynamic power that was hardly distinguishable from power measured while the core was inactive.
Conclusion and Future Directions
The study demonstrated the feasibility and effectiveness of the first scalable MatMul-free language model. This work challenges the paradigm that MatMul operations are indispensable for building high-performing language models and paves the way for the development of more efficient and hardware-friendly architectures. However, one limitation of the work is that the MatMul-free LM has not been tested on extremely large-scale models (e.g., 100B+ parameters) due to computational constraints. This work serves as a call to action for institutions and organizations that have the resources to build the largest language models to invest in accelerating lightweight models.
External Link: Click Here For More
