Optimising machine learning tasks demands high-performance code, yet achieving this across a growing range of specialised hardware remains a significant challenge. To address this, Taras Sereda, Tom St. John, and Burak Bartan, alongside Natalie Serrino, Sachin Katti, and Zain Asgar, present KForge, a novel framework that automates the creation of optimised code for diverse AI accelerators. KForge employs a collaborative system of artificial intelligence agents, where one agent generates and refines programs based on testing and feedback, while another analyses performance data to guide improvements. This approach requires only a single example to adapt to new hardware, and the team demonstrates successful code generation across both CUDA and Apple Metal platforms, representing a substantial step towards simplifying the deployment of machine learning models on a wide variety of computing devices.

The research presents an agent that interprets profiling data to guide optimisation. This agent-based architecture requires only a single example to target new platforms, representing a significant advancement in automated kernel development. The work introduces an iterative refinement system where a generation agent and a performance analysis agent collaborate through functional and optimisation passes, interpreting diverse profiling data from both programmatic APIs and GUI-based tools to generate actionable recommendations that guide program synthesis for arbitrary accelerators. Furthermore, the research demonstrates that the generation agent effectively leverages cross-platform knowledge transfer, where a reference implementation from one architecture substantially improves performance on another.

Optimized Kernels and Precomputed Constant Buffers

The models focus on achieving maximum performance on Apple Silicon and other hardware accelerators through key optimization strategies. These strategies include reducing kernel launches, which carry significant overhead, and precomputing constant values to avoid redundant calculations during execution. Ensuring data is contiguous in memory and resides on the correct device is also crucial for efficient processing. The models consistently leverage highly optimized libraries, such as those available for NVIDIA GPUs and Apple Silicon, and employ constant folding to simplify expressions at compile time.

One prominent approach involves collapsing sequences of operations into single, highly optimized operations, like matrix-vector or matrix-matrix multiplication. For example, several models implement highly optimized versions of a sequence involving a linear layer, maximum value selection, subtraction, and GELU activation. The key insight is that the entire sequence simplifies to a constant zero tensor, which is precomputed and reused, eliminating all kernel launches and reducing computation to a simple buffer return. Other models reduce complex sequences to a single dot product, precomputing necessary sums and utilizing optimized linear algebra libraries.

The models demonstrate a clear focus on performance, employing well-structured and commented code that is easy to understand. They rely on a deep understanding of the mathematical properties of the operations being performed and consistently prioritize minimizing computational complexity. The repetition observed in the code reflects a systematic approach to optimization, ensuring consistent application of these techniques across different models. In summary, these models demonstrate a sophisticated approach to performance optimization, focusing on reducing computational complexity and leveraging hardware acceleration. The key takeaway is that by carefully analyzing the mathematical properties of the operations and leveraging optimized libraries, significant performance gains can be achieved.

KForge Automates AI Kernel Code Generation

Scientists developed KForge, a novel platform-agnostic framework for automatically generating high-performance code for diverse AI hardware accelerators, utilizing a collaborative agent-based system. The work centers on two agents: a generation agent that creates and refines programs, and a performance analysis agent that interprets profiling data to guide optimization. This system achieves effective program synthesis with only a single example targeting a new platform, representing a significant advancement in automated kernel development. Experiments reveal that KForge successfully leverages cross-platform knowledge transfer, where providing a reference implementation from one architecture substantially improves the quality of generated code for different hardware targets.

Specifically, the team demonstrated effective program synthesis across fundamentally different parallel computing platforms, CUDA and Apple Metal, confirming the framework’s versatility. The generation agent iteratively refines programs through compilation and correctness feedback, guided by the performance analysis agent’s interpretation of diverse profiling data obtained from both programmatic APIs and GUI-based tools. Researchers established a program synthesis agent that functions as a model, accepting a text prompt and returning generated code containing a kernel program, kernel scheduling code, a JIT-library compilation code, and a PyTorch model class. The team utilized the Jinja2 template engine to parameterize prompts, enabling dynamic configuration based on available supervision and computational resources.

Cross-Platform Program Synthesis with Collaborative Agents

KForge represents a significant advance in automated program synthesis for diverse hardware accelerators. Researchers developed a platform-agnostic framework employing two collaborative agents, one for program generation and refinement, and another for performance analysis. This system iteratively improves programs through compilation, correctness feedback, and interpretation of profiling data, enabling optimization for arbitrary accelerator architectures. The team demonstrated that transferring knowledge from one platform substantially enhances program quality on different hardware, showcasing effective cross-platform learning.

Validation across CUDA and Apple Metal platforms confirms the framework’s adaptability and broad applicability. Experiments reveal that incorporating performance recommendations generated by the analysis agent improves performance, particularly for more complex problems, with some models achieving up to a 30% increase in speed. Furthermore, synthesized programs demonstrate robustness across varying input shapes and batch sizes, consistently outperforming standard PyTorch implementations in certain low-latency regimes. The team acknowledges that profiling information alone is not always sufficient for improvement and can, in some cases, lead to performance degradation, an area they intend to investigate further. Future work may explore combining kernel-level optimizations from KForge with higher-level graph optimizations, potentially yielding even greater performance gains.

👉 More information
🗞 KForge: Program Synthesis for Diverse AI Hardware Accelerators
🧠 ArXiv: https://arxiv.org/abs/2511.13274

Tags:

Apple Metal cross-platform knowledge transfer CUDA functional passes GPU kernels Iterative Refinement LLM-based agents optimisation passes Performance Analysis Program Synthesis

Kforge: Program Synthesis with LLM Agents Enables Single-Shot Targeting of Diverse AI Hardware Accelerators

Optimized Kernels and Precomputed Constant Buffers

KForge Automates AI Kernel Code Generation

Cross-Platform Program Synthesis with Collaborative Agents

Rohail T.

Latest Posts by Rohail T.:

Sensors Bypass Limits to Measure Faint Classical Fields

Cooperative Interactions Limit Slow-Light Delay to 0.5

Circuits Learn Ordered States Beyond Equilibrium’s Limits