On April 18, 2025, a team led by Vicki Carrica and Maxwell Onyango published Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM, detailing an efficient Julia-based approach to triangular matrix operations on NVIDIA, AMD, and Apple Silicon GPUs.
This paper presents a recursive implementation in Julia for GPUs of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM), restructured to leverage general matrix-matrix multiplication (GEMM) for improved GPU memory hierarchy utilization.
Using Julia’s multiple dispatch, metaprogramming, and frameworks like GPUArrays and KernelAbstractions, the authors developed a hardware-agnostic API supporting NVIDIA, AMD, and Apple Silicon GPUs. For large matrices, the implementation achieves throughput comparable to vendor libraries like cuBLAS and rocBLAS while providing TRMM/TRSM routines for Apple Silicon for the first time. The concise codebase demonstrates Julia’s ability to deliver near-vendor performance across heterogeneous architectures.
NVIDIA is at the forefront of advancing GPU technology, significantly impacting fields such as artificial intelligence, scientific research, and high-performance computing (HPC). Their innovations are strategically aimed at enhancing efficiency, scalability, and adaptability across diverse applications.
👉 More information
🗞 Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM
🧠DOI: https://doi.org/10.48550/arXiv.2504.13821
