Matrix multiplication, a fundamental operation in numerous scientific and machine learning applications, receives a significant boost from new research led by Songqiao Su, Xiaofei Sun, and Xiaoya Li, alongside Albert Wang, Jiwei Li, and Chris Shum. The team, working with deep-reinforce. com, presents CUDA-L2, a system that leverages the power of large language models and reinforcement learning to automatically optimise the performance of matrix multiplication kernels. This innovative approach systematically surpasses existing state-of-the-art libraries, including cuBLAS and cuBLASLt, achieving speed improvements of up to 28. 7% in realistic server conditions. By intelligently exploring a vast configuration space, CUDA-L2 demonstrates that even heavily optimised kernels can benefit from LLM-guided automation, paving the way for further performance gains in critical computational tasks.

The performance of matrix multiplication kernels relies heavily on transformations that vary across different GPU architectures, making comprehensive manual tuning difficult at scale. This work introduces CUDA-L2, a system that combines large language models and reinforcement learning to automatically optimise Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the reinforcement learning reward, CUDA-L2 automatically optimises HGEMM kernels across 1,000 configurations, representing a comprehensive test of its adaptability and efficiency, covering all combinations of matrix dimensions commonly used in attention and Feed Forward Network layers of widely used models like Qwen, Llama and DeepSeek.

CuBLAS and cuBLASLt Kernel Performance Comparison

Scientists are continually striving to improve the efficiency of matrix multiplication on NVIDIA GPUs, and this work details a comparison of three different approaches. The first uses cuBLAS, a standard, general-purpose library for linear algebra. The second, cuBLASLt-heuristic, employs a low-precision toolkit to select an algorithm based on a rule-of-thumb approach, while cuBLASLt-benchmark systematically tests multiple algorithms to empirically determine the fastest one for a given problem size. Each approach implements the core matrix multiplication routines differently, with cuBLAS providing a baseline for comparison.

CUDA-L2 Achieves 22% HGEMM Speedup

Scientists have achieved a significant breakthrough in optimising Half-precision General Matrix Multiply (HGEMM) CUDA kernels using CUDA-L2, a novel system that combines large language models and reinforcement learning. This work systematically improves performance across a vast configuration space, demonstrating substantial gains over existing, highly-optimised libraries. The team evaluated CUDA-L2 across 1,000 different matrix dimension configurations, and experiments reveal an average speedup of +22. 0% over the widely-used torch. matmul library in offline execution.

When compared against NVIDIA’s cuBLAS library, using its optimal layout configuration, CUDA-L2 achieves a +19. 2% performance increase, exceeding cuBLASLt-heuristic by +16. 8% and cuBLASLt-AutoTuning by +11. 4%. In a server scenario, simulating real-time inference, the speedups increased significantly, reaching +28.

7% over torch. matmul, +26. 0% over cuBLAS, +22. 4% over cuBLASLt-heuristic, and +15. 9% over cuBLASLt-AutoTuning. The research demonstrates that LLM-guided reinforcement learning can systematically explore and optimise even the most performance-critical kernels, unlocking improvements beyond manual tuning capabilities.

Automated CUDA Kernel Optimisation via Reinforcement Learning

The research team developed CUDA-L2, a novel system that combines large language models and reinforcement learning to automatically optimise CUDA kernels for Half-precision General Matrix Multiply, a computationally intensive operation. By systematically exploring a vast configuration space, CUDA-L2 achieves significant performance improvements across diverse settings, consistently outperforming established libraries, including torch. matmul, cuBLAS, and cuBLASLt. These gains were achieved through a multi-stage reinforcement learning process, beginning with general kernel optimisation and progressing to a focus on matrix multiplication, augmented by retrieval-augmented context from diverse CUDA code. The results demonstrate a clear advantage of LLM-guided reinforcement learning in discovering superior implementations that surpass manually optimised kernels, particularly when dealing with complex configuration spaces.

👉 More information
🗞 CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
🧠 ArXiv: https://arxiv.org/abs/2512.02551

Tags:

auto-tuning Cublas cuBLASLt cuda kernels CUDA-L2 HGEMM Large Language Models performance optimisation Reinforcement Learning

Cuda-l2 Surpasses cuBLAS Performance with Reinforcement Learning, Achieving +22.0% Speedup in Matrix Multiplication

CuBLAS and cuBLASLt Kernel Performance Comparison

CUDA-L2 Achieves 22% HGEMM Speedup

Automated CUDA Kernel Optimisation via Reinforcement Learning

Rohail T.

Latest Posts by Rohail T.:

Quantum Light’s Wave-Particle Balance Now Fully Tunable

AI Swiftly Answers Questions by Focusing on Key Areas

Machine Learning Sorts Quantum States with High Accuracy