Matrix multiplication, a fundamental operation in numerous scientific and machine learning applications, receives a significant boost from new research led by Songqiao Su, Xiaofei Sun, and Xiaoya Li, alongside Albert Wang, Jiwei Li, and Chris Shum. The team, working with deep-reinforce. com, presents CUDA-L2, a system that leverages the power of large language models and reinforcement learning to automatically optimise the performance of matrix multiplication kernels. This innovative approach systematically surpasses existing state-of-the-art libraries, including cuBLAS and cuBLASLt, achieving speed improvements of up to 28. 7% in realistic server conditions. By intelligently exploring a vast configuration space, CUDA-L2 demonstrates that even heavily optimised kernels can benefit from LLM-guided automation, paving the way for further performance gains in critical computational tasks.
The performance of matrix multiplication kernels relies heavily on transformations that vary across different GPU architectures, making comprehensive manual tuning difficult at scale. This work introduces CUDA-L2, a system that combines large language models and reinforcement learning to automatically optimise Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the reinforcement learning reward, CUDA-L2 automatically optimises HGEMM kernels across 1,000 configurations, representing a comprehensive test of its adaptability and efficiency, covering all combinations of matrix dimensions commonly used in attention and Feed Forward Network layers of widely used models like Qwen, Llama and DeepSeek.
CuBLAS and cuBLASLt Kernel Performance Comparison
Scientists are continually striving to improve the efficiency of matrix multiplication on NVIDIA GPUs, and this work details a comparison of three different approaches. The first uses cuBLAS, a standard, general-purpose library for linear algebra. The second, cuBLASLt-heuristic, employs a low-precision toolkit to select an algorithm based on a rule-of-thumb approach, while cuBLASLt-benchmark systematically tests multiple algorithms to empirically determine the fastest one for a given problem size. Each approach implements the core matrix multiplication routines differently, with cuBLAS providing a baseline for comparison.
CUDA-L2 Achieves 22% HGEMM Speedup
Scientists have achieved a significant breakthrough in optimising Half-precision General Matrix Multiply (HGEMM) CUDA kernels using CUDA-L2, a novel system that combines large language models and reinforcement learning. This work systematically improves performance across a vast configuration space, demonstrating substantial gains over existing, highly-optimised libraries. The team evaluated CUDA-L2 across 1,000 different matrix dimension configurations, and experiments reveal an average speedup of +22. 0% over the widely-used torch. matmul library in offline execution.
When compared against NVIDIA’s cuBLAS library, using its optimal layout configuration, CUDA-L2 achieves a +19. 2% performance increase, exceeding cuBLASLt-heuristic by +16. 8% and cuBLASLt-AutoTuning by +11. 4%. In a server scenario, simulating real-time inference, the speedups increased significantly, reaching +28.
7% over torch. matmul, +26. 0% over cuBLAS, +22. 4% over cuBLASLt-heuristic, and +15. 9% over cuBLASLt-AutoTuning. The research demonstrates that LLM-guided reinforcement learning can systematically explore and optimise even the most performance-critical kernels, unlocking improvements beyond manual tuning capabilities.
Automated CUDA Kernel Optimisation via Reinforcement Learning
The research team developed CUDA-L2, a novel system that combines large language models and reinforcement learning to automatically optimise CUDA kernels for Half-precision General Matrix Multiply, a computationally intensive operation. By systematically exploring a vast configuration space, CUDA-L2 achieves significant performance improvements across diverse settings, consistently outperforming established libraries, including torch. matmul, cuBLAS, and cuBLASLt. These gains were achieved through a multi-stage reinforcement learning process, beginning with general kernel optimisation and progressing to a focus on matrix multiplication, augmented by retrieval-augmented context from diverse CUDA code. The results demonstrate a clear advantage of LLM-guided reinforcement learning in discovering superior implementations that surpass manually optimised kernels, particularly when dealing with complex configuration spaces.
👉 More information
🗞 CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2512.02551
