Researchers present a novel framework utilising large language models to enhance CUDA code translation to other platforms, addressing compatibility issues within the dominant GPU computing ecosystem. This approach, augmented by data generation and a new evaluation benchmark, demonstrably improves performance on CUDA to CPU transpilation tasks.

The demand for computational power in deep learning continues to escalate, placing significant strain on hardware and necessitating efficient methods for porting software across diverse architectures. The CUDA ecosystem, built around Nvidia GPUs, currently dominates parallel computing, but achieving performance portability to alternative platforms remains a considerable challenge. Researchers are now exploring the application of large language models (LLMs) to automate the translation of CUDA code, a process known as transpilation, but the efficacy of these models is limited by the availability of suitable training data. A team comprising Xufeng He, Yanchen Liu, and Xu Dai from the Shanghai Artificial Intelligence Laboratory, alongside Jiaqi Lv of Tongji University and Yang Hu and Shouyi Yin from Tsinghua University, present a novel approach to address this issue in their paper, “HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration”. They detail the creation of a dataset generated using AI compiler technology and automatic optimisation, alongside a benchmark, HPCTransEval, designed to assess LLM performance in CUDA transpilation tasks.

Large language models are actively reshaping software engineering, notably through the automation of code generation and performance enhancement across diverse computing platforms. Researchers increasingly employ models such as Transformers, DeepSeek-Coder, and CodeGeex for tasks including code completion, comprehension, and crucially, translation between different computing environments. Established benchmarks, like HumanEval and CodeEval, quantitatively assess the capabilities of these models, driving innovation and establishing performance baselines within the field.

The CUDA ecosystem, NVIDIA’s parallel computing platform and programming model, presents a significant challenge, as its dominance necessitates solutions for performance portability to alternative hardware architectures. Existing approaches to code translation, which often rely on language extensions or domain-specific languages, frequently prove limited in scope and incur substantial development costs, hindering broader adoption and innovation. The presented framework addresses this limitation by leveraging large language models to generate paired CUDA code and platform-specific code, effectively bridging the compatibility gap and unlocking new possibilities for heterogeneous computing.

This framework distinguishes itself through the seamless integration of artificial intelligence compiler technology and automatic optimisation techniques, enabling the generation of high-performance code tailored to specific target platforms. Researchers developed a key innovation, combining graph-based data augmentation with AI compiler technology. Graph-based data augmentation expands the training dataset by generating synthetic code examples, improving the model’s ability to generalise to new and unseen code patterns. An AI compiler then translates the code, optimising it for the target architecture.

Experimental results demonstrate a significant improvement in performance when utilising the proposed framework, validating its effectiveness and potential. This suggests that large language models, when properly trained and augmented with AI compiler techniques, possess considerable potential for addressing compatibility challenges within the CUDA ecosystem and beyond. The framework improves the quality of translated code, moving beyond simple syntactic correctness towards genuine performance optimisation, ensuring that the generated code not only functions correctly but also executes efficiently on the target platform.

Future work should concentrate on expanding the framework’s scope to encompass a wider range of target platforms and hardware architectures, enabling broader applicability and maximising its impact. Investigating the application of reinforcement learning techniques to further refine the code generation process and optimise for specific performance metrics represents a promising avenue for exploration, potentially leading to even more efficient and tailored code generation. Furthermore, research into the development of more sophisticated data augmentation strategies, potentially incorporating formal verification techniques, could further enhance the robustness and reliability of the generated code, ensuring its correctness and preventing errors.

Expanding HPCTransEval, a benchmark designed to evaluate code transpilation, to include a more diverse set of CUDA kernels and performance characteristics will be crucial for establishing a comprehensive evaluation tool for large language model-based code transpilation, providing a more accurate and nuanced assessment of performance. Exploring the potential of this framework for automating the porting of legacy CUDA code to emerging hardware platforms, such as those incorporating novel memory architectures or processing paradigms, could unlock significant opportunities for innovation and efficiency, enabling the reuse of existing codebases and reducing development costs.

👉 More information
🗞 HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration
🧠 DOI: https://doi.org/10.48550/arXiv.2506.10401

Tags:

AI compilers CUDA data augmentation Deep Learning HPCTransEval LLMs parallel computing Performance Portability transpilation workload coverage

Quantum News

AI Improves CUDA Code Compatibility Across Diverse Hardware Platforms.

Latest Posts by Quantum News:

Multiverse Computing Launches Quantum Inspired HyperNova 60B 2602, 50% Compressed LLM, on Hugging Face

AWS Quantum Technologies Blog: New QGCA Outperforms Simulated Annealing on Complex Optimization Problems

AWS Quantum Technologies Releases Qiskit-Braket Provider v0.11, Now Compatible with Qiskit 2.0