Cuda-l2 Surpasses cuBLAS Performance with Reinforcement Learning, Achieving +22.0% Speedup in Matrix Multiplication

Matrix multiplication, a fundamental operation in numerous scientific and machine learning applications, receives a significant boost from new research led by Songqiao Su, Xiaofei Sun, and Xiaoya Li, alongside Albert Wang, Jiwei Li, and Chris Shum. The team, working with deep-reinforce. com, presents CUDA-L2, a system that leverages the power of large language models and reinforcement learning to automatically optimise the performance of matrix multiplication kernels. This innovative approach systematically surpasses existing state-of-the-art libraries, including cuBLAS and cuBLASLt, achieving speed improvements of up to 28. 7% in realistic server conditions. By intelligently exploring a vast configuration space, CUDA-L2 demonstrates that even heavily optimised kernels can benefit from LLM-guided automation, paving the way for further performance gains in critical computational tasks.

The performance of matrix multiplication kernels relies heavily on transformations that vary across different GPU architectures, making comprehensive manual tuning difficult at scale. This work introduces CUDA-L2, a system that combines large language models and reinforcement learning to automatically optimise Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the reinforcement learning reward, CUDA-L2 automatically optimises HGEMM kernels across 1,000 configurations, representing a comprehensive test of its adaptability and efficiency, covering all combinations of matrix dimensions commonly used in attention and Feed Forward Network layers of widely used models like Qwen, Llama and DeepSeek.

CuBLAS and cuBLASLt Kernel Performance Comparison

Scientists are continually striving to improve the efficiency of matrix multiplication on NVIDIA GPUs, and this work details a comparison of three different approaches. The first uses cuBLAS, a standard, general-purpose library for linear algebra. The second, cuBLASLt-heuristic, employs a low-precision toolkit to select an algorithm based on a rule-of-thumb approach, while cuBLASLt-benchmark systematically tests multiple algorithms to empirically determine the fastest one for a given problem size. Each approach implements the core matrix multiplication routines differently, with cuBLAS providing a baseline for comparison.

CUDA-L2 Achieves 22% HGEMM Speedup

Scientists have achieved a significant breakthrough in optimising Half-precision General Matrix Multiply (HGEMM) CUDA kernels using CUDA-L2, a novel system that combines large language models and reinforcement learning. This work systematically improves performance across a vast configuration space, demonstrating substantial gains over existing, highly-optimised libraries. The team evaluated CUDA-L2 across 1,000 different matrix dimension configurations, and experiments reveal an average speedup of +22. 0% over the widely-used torch. matmul library in offline execution.

When compared against NVIDIA’s cuBLAS library, using its optimal layout configuration, CUDA-L2 achieves a +19. 2% performance increase, exceeding cuBLASLt-heuristic by +16. 8% and cuBLASLt-AutoTuning by +11. 4%. In a server scenario, simulating real-time inference, the speedups increased significantly, reaching +28.

7% over torch. matmul, +26. 0% over cuBLAS, +22. 4% over cuBLASLt-heuristic, and +15. 9% over cuBLASLt-AutoTuning. The research demonstrates that LLM-guided reinforcement learning can systematically explore and optimise even the most performance-critical kernels, unlocking improvements beyond manual tuning capabilities.

Automated CUDA Kernel Optimisation via Reinforcement Learning

The research team developed CUDA-L2, a novel system that combines large language models and reinforcement learning to automatically optimise CUDA kernels for Half-precision General Matrix Multiply, a computationally intensive operation. By systematically exploring a vast configuration space, CUDA-L2 achieves significant performance improvements across diverse settings, consistently outperforming established libraries, including torch. matmul, cuBLAS, and cuBLASLt. These gains were achieved through a multi-stage reinforcement learning process, beginning with general kernel optimisation and progressing to a focus on matrix multiplication, augmented by retrieval-augmented context from diverse CUDA code. The results demonstrate a clear advantage of LLM-guided reinforcement learning in discovering superior implementations that surpass manually optimised kernels, particularly when dealing with complex configuration spaces.

👉 More information
🗞 CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2512.02551

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Networks Promise Unhackable Communications and Super-Accurate Sensors

Quantum Networks Promise Unhackable Communications and Super-Accurate Sensors

February 7, 2026
New Software Accelerates Complex Calculations by up to 500times

New Software Accelerates Complex Calculations by up to 500times

February 7, 2026
Rapid Quantum Control Technique Boosts Signal Transfer across Wider Frequencies

Rapid Quantum Control Technique Boosts Signal Transfer across Wider Frequencies

February 6, 2026