High-Performance Recursive TRMM/TRSM Implementation in Julia for GPUs Across Architectures

On April 18, 2025, a team led by Vicki Carrica and Maxwell Onyango published Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM, detailing an efficient Julia-based approach to triangular matrix operations on NVIDIA, AMD, and Apple Silicon GPUs.

This paper presents a recursive implementation in Julia for GPUs of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM), restructured to leverage general matrix-matrix multiplication (GEMM) for improved GPU memory hierarchy utilization.

Using Julia’s multiple dispatch, metaprogramming, and frameworks like GPUArrays and KernelAbstractions, the authors developed a hardware-agnostic API supporting NVIDIA, AMD, and Apple Silicon GPUs. For large matrices, the implementation achieves throughput comparable to vendor libraries like cuBLAS and rocBLAS while providing TRMM/TRSM routines for Apple Silicon for the first time. The concise codebase demonstrates Julia’s ability to deliver near-vendor performance across heterogeneous architectures.

NVIDIA is at the forefront of advancing GPU technology, significantly impacting fields such as artificial intelligence, scientific research, and high-performance computing (HPC). Their innovations are strategically aimed at enhancing efficiency, scalability, and adaptability across diverse applications.

👉 More information
đź—ž Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM
đź§  DOI: https://doi.org/10.48550/arXiv.2504.13821

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

Toyota & ORCA Achieve 80% Compute Time Reduction Using Quantum Reservoir Computing

Toyota & ORCA Achieve 80% Compute Time Reduction Using Quantum Reservoir Computing

January 14, 2026
GlobalFoundries Acquires Synopsys’ Processor IP to Accelerate Physical AI

GlobalFoundries Acquires Synopsys’ Processor IP to Accelerate Physical AI

January 14, 2026
Fujitsu & Toyota Systems Accelerate Automotive Design 20x with Quantum-Inspired AI

Fujitsu & Toyota Systems Accelerate Automotive Design 20x with Quantum-Inspired AI

January 14, 2026