NVIDIA has extended its CUDA Tile abstraction to the Julia programming language with cuTile.jl, offering improved performance and code portability for GPU computing. Building on the earlier release of cuTile for Python, this expansion allows developers to write high-performance kernels focused on data tiles rather than intricate thread and memory management. With CUDA Tile, developers describe operations on tiles of data, and the compiler handles the mapping to hardware, simplifying the development process. The design of cuTile.jl deliberately maintains consistency with its Python counterpart, ensuring code can be easily ported and lessons learned readily applied, while also embracing Julia’s native syntax for an intuitive experience.
cuTile.jl Simplifies GPU Kernel Development with Tile-Based Programming
Traditional CUDA programming demands meticulous attention to threads, warps, and memory hierarchies, requiring programmers to efficiently map algorithms onto the underlying hardware. The Julia version, utilizing ct.bid(1) and ct.load, hides index calculations and out-of-bounds checks, presenting a cleaner, more intuitive interface. The design of cuTile.jl prioritizes consistency with its Python counterpart, allowing for easy code portability and knowledge transfer. According to developers, “cuTile.jl keeps the abstraction level of kernels identical to those written in cuTile Python, making it easy to port code over or learn from the cuTile Python documentation.” The tool leverages Julia’s inherent strengths, incorporating 1-based indexing and broadcast expressions to create kernels that feel natural to Julia programmers.
Performance testing on an NVIDIA GeForce RTX 5080 demonstrates impressive results. For vector addition, cuTile.jl achieved 838 GB/s, which is 99% of the Python implementation’s performance, while matrix transpose achieved 797 GB/s, representing 98% parity with the Python implementation, and matrix multiplication reached 50.9 TFLOPS, matching the Python performance. Batch matrix multiply achieved 43.0 TFLOPS, representing 91% parity with the Python implementation. Some kernels with more complex control flow, such as layer normalization, do not yet reach full performance parity, as the cuTile.jl compiler is still maturing. These issues are tracked and actively being addressed. The project is currently experimental and open-source, available through JuliaGPU/cuTile.jl, and requires an NVIDIA Ada, Ampere, or Blackwell GPU with CUDA 13.1 or higher.
Julia and Python Performance Parity on NVIDIA GPUs
This extension is not merely a port, but a deliberate effort to achieve performance parity between the two languages while leveraging the unique strengths of Julia. A key aspect of this advancement is a shift in how developers approach GPU kernels. The vector addition example demonstrates this design philosophy, where the tile-based approach dramatically reduces code complexity compared to traditional CUDA.jl programming. The resulting kernels, as demonstrated with the row-normalization example, read remarkably like standard Julia array code, utilizing familiar functions like sum, size, and sqrt alongside Julia’s broadcasting syntax. Performance benchmarks on an NVIDIA GeForce RTX 5080 reveal a compelling story; compute-intensive kernels, such as vector addition and matrix multiplication, achieve near-identical performance between cuTile.jl and cuTile Python, with matrix multiplication yielding 50.9 TFLOPS in both languages. Batch matrix multiply achieves 43.0 TFLOPS with cuTile.jl. Some kernels with more complex control flow, such as layer normalization or FFT, do not reach full performance parity, as the cuTile.jl compiler is still maturing, but these issues are tracked and actively being addressed.
The closer cuTile.jl kernels are to ordinary Julia, the easier it is to share and reuse code between the CPU and GPU.
cuTile.jl Compiler Architecture and Tile IR Integration
Researchers at JuliaGPU are extending high-performance GPU programming paradigms with the release of cuTile.jl, a compiler designed to integrate seamlessly with the established CUDA Tile abstraction. This advancement builds upon the earlier cuTile implementation for Python, offering Julia developers a new pathway to harness the power of specialized hardware like tensor cores without the complexities of traditional CUDA programming. Unlike conventional CUDA workflows that demand meticulous management of threads, warps, and memory hierarchies, cuTile.jl abstracts these details, enabling programmers to focus on data-level operations.
A simple vector addition, for example, transitions from explicit thread management in CUDA.jl to a tile-centric approach where the code reads, “import cuTile as ct function vadd(a, b, c, tile_size) pid = ct.bid(1) tile_a = ct.load(a, pid, (tile_size,)) tile_b = ct.load(b, pid, (tile_size,)) ct.store(c, pid, tile_a + tile_b) return end.” This shift is deliberate; cuTile.jl aims for consistency with its Python counterpart, making code porting and knowledge transfer easier. The design also leverages Julia’s strengths, including 1-based indexing and broadcasting for element-wise operations, resulting in kernels that closely resemble standard Julia array code. The compiler operates by intercepting standard Julia library calls and routing them to Tile IR operations, ultimately generating Tile IR bytecode, the same format used by cuTile Python, before final compilation to GPU machine code.
This transparency allows for debugging and understanding how high-level Julia code translates to tile operations, as evidenced by the ability to inspect the generated Tile IR: “julia> ct.@device_code_tiled ct.launch(vadd, grid, a, b, c, ct.Constant(16)) cuda_tile.module @kernels { entry @vadd(%arg0: tile
