The increasing demand for GPU computing necessitates automated strategies to optimise CUDA code, but current approaches often fall short of delivering substantial improvements. Xiaoya Li, Xiaofei Sun, and Albert Wang, from the DeepReinforce Team, alongside Jiwei Li and Chris Shum, present a new framework called CUDA-L1 that tackles this challenge using contrastive reinforcement learning. This innovative system achieves remarkable performance gains, delivering an average speedup of 17.7x across a comprehensive suite of CUDA kernels. Importantly, CUDA-L1 not only accelerates existing code but also discovers novel optimisation techniques and demonstrates strong portability across various NVIDIA GPU architectures, including the H100, RTX 3090, and H800, suggesting a pathway towards significantly improved GPU efficiency and a reduction in the growing strain on computing resources.
Existing automated approaches, such as DeepSeek-R1 and OpenAI-o1, often struggle to improve the speed of CUDA programs significantly. These systems typically rely on heuristic search or supervised learning methods, which prove inadequate when addressing the complex optimisation landscape of modern GPU kernels. This research introduces CUDA-L1, an automated reinforcement learning framework designed to optimise CUDA code. At the heart of CUDA-L1 lies a contrastive reinforcement learning model, a novel system that enhances optimisation through comparative analysis. Unlike traditional reinforcement learning models that evaluate code changes in isolation, contrastive reinforcement learning assesses new CUDA variants alongside their performance, allowing the model to learn by distinguishing between effective and ineffective optimisation strategies. This comparative approach, inspired by human learning, allows the agent to more rapidly converge on optimal solutions by focusing on relative improvements rather than absolute performance metrics. The results demonstrate that CUDA-L1 achieves substantial performance gains on challenging CUDA benchmarks, consistently outperforming existing automated optimisation tools and approaching the performance of hand-tuned kernels. This improvement stems from the modelโs ability to navigate the vast search space of possible CUDA optimisations, identifying subtle yet impactful changes that would be difficult for a human expert to discover.
Contrastive Reinforcement Learning Methodology
The core of CUDA-L1โs optimisation process lies in its contrastive reinforcement learning framework. Traditional reinforcement learning algorithms, such as Q-learning or policy gradients, assign rewards based on the absolute performance of a given CUDA kernel variant. This approach can be inefficient, particularly in high-dimensional search spaces, as it struggles to differentiate between minor improvements and significant gains. Contrastive reinforcement learning, however, focuses on relative performance. The agent is presented with pairs of CUDA variants , one generated by the current policy and another sampled randomly or from a previous iteration. The agent then learns to predict which variant performs better, effectively learning a preference function. This preference function is then used to guide the policy update, encouraging the agent to generate variants that are consistently preferred over others. The reward signal is therefore derived from the difference in performance between the two variants, rather than the absolute performance of either. This approach significantly accelerates learning and improves the stability of the optimisation process.
Implications and Future Directions
The development of CUDA-L1 has significant implications for the field of high-performance computing. Automated optimisation of CUDA code is crucial for maximising the performance of applications running on GPUs, particularly in areas such as deep learning, scientific simulations, and data analytics. By automating this process, CUDA-L1 reduces the need for manual tuning by expert programmers, saving time and resources. Furthermore, the contrastive reinforcement learning methodology employed by CUDA-L1 is applicable to a wide range of other optimisation problems, beyond just CUDA code. Future research will focus on extending CUDA-L1 to support a broader range of CUDA features and optimisations, as well as exploring the use of more sophisticated reinforcement learning algorithms. Another promising direction is to integrate CUDA-L1 with existing compiler frameworks, allowing it to be seamlessly integrated into the software development workflow. This would enable developers to automatically optimise their CUDA code as part of the compilation process, further simplifying the optimisation process and improving the performance of GPU-accelerated applications.
๐ More information
๐ CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
๐ง DOI: https://doi.org/10.48550/arXiv.2507.14111
