Efficient training and inference of deep learning recommendation models present significant challenges due to the diversity of model architectures, computational kernels, and increasingly heterogeneous hardware. To address these issues, Gang Liao, Hongsen Qin, and Ying Wang, at Meta, alongside Alicia Golden, Michael Kuchnik, and Yavuz Yetim, introduce KernelEvolve, an agentic kernel coding framework that automates the generation and optimisation of kernels for recommendation models across diverse hardware. This innovative system operates across multiple programming levels, from high-level domain specific languages to low-level code, and utilises a dynamic, graph-based search process to adapt to runtime conditions. By successfully optimising a wide range of production models on both conventional GPUs and Meta’s own AI accelerators, and achieving a perfect pass rate on comprehensive benchmark suites, KernelEvolve dramatically reduces development time and delivers substantial performance gains, while also lowering the barrier to programming new AI hardware.

The framework accepts kernel specifications as input and automates kernel generation and optimisation for recommendation models across diverse hardware architectures. This automation encompasses multiple programming abstractions, spanning the complete hardware-software optimisation stack, and dynamically adapts to the runtime execution context through retrieval.

Triton 1D Convolution Benchmark Versus PyTorch

This code implements a benchmark for a 1D convolutional layer using Triton and compares its performance against a PyTorch reference implementation. The Triton kernel is a highly optimized implementation of the convolution operation, and the benchmark uses the TritonBench framework to measure its latency and speedup compared to PyTorch. Accuracy checks ensure the Triton kernel produces results consistent with the PyTorch reference. The code structure includes functions to generate input data, define PyTorch and Triton models, and encapsulate the benchmarking logic. A decorator registers the implementations as benchmarks within the TritonBench framework, and the run() function orchestrates the benchmark execution.

Key components include PyTorch’s compiler, which optimizes the model, and a context manager that disables gradient calculation for accurate measurements. Potential improvements include increasing the flexibility of input data generation, further optimizing the Triton kernel through tiling strategies and memory access patterns, and adding more robust error handling. Profiling can identify performance bottlenecks, and experimenting with batch sizes and data types can further refine performance. The inclusion of a 2D convolution baseline requires careful configuration to ensure a fair comparison, and code clarity can be improved with comments and descriptive variable names. To run the code, users install dependencies, save the file, and execute the benchmark with command-line arguments specifying the device, precision, and metrics. This code provides a solid foundation for benchmarking a 1D convolution kernel using Triton, and addressing the potential improvements can yield more meaningful results.

KernelEvolve Automates Deep Learning Kernel Optimisation

Scientists have developed KernelEvolve, a groundbreaking agentic kernel coding framework designed to optimize deep learning recommendation model (DLRM) training and inference across diverse hardware. This work addresses challenges stemming from model architecture diversity, kernel variations, and hardware heterogeneity. KernelEvolve automates kernel generation and optimization, operating across multiple programming abstractions, encompassing the entire hardware-software optimization stack. Experiments demonstrate 100% pass rates on the KernelBench suite and successful optimization of 160 PyTorch ATen operators on three hardware platforms, confirming complete correctness.

The team measured substantial performance improvements over PyTorch baselines, reducing development time from weeks to hours. For Llama-3.1-8B inference workloads, Vanilla Attention saw a 4.6x speedup, while SDPA-MLP achieved 3.3x faster performance.

Tests revealed significant gains across convolutional transformers, with conv1d and conv2d operators experiencing 6.5x and 4.7x speedups respectively. Memory-bound data preprocessing operators also benefited, with improvements ranging from 4.1x to 9.

Compute-intensive fusion kernels in ranking models, such as WuKong Optimized FM and InterFormer PFFN, demonstrated speedups of 4.0x and 2.5x, while RMSNorm 2D backward reached an impressive 17x acceleration. Even retrieval operations, including Sparse Inverted Index, saw a 1.25x performance boost, validating KernelEvolve’s broad applicability.

Automated Kernel Optimisation for Recommendation Models

KernelEvolve represents a significant advance in automated kernel generation and optimisation for deep learning recommendation models, addressing a critical bottleneck in large-scale machine learning infrastructure. Researchers have developed an agentic framework that automates the process of creating and refining computational kernels across diverse hardware platforms. This system operates across multiple programming abstractions, enabling it to adapt to a wide range of hardware architectures. The team validated KernelEvolve on a public benchmark suite and production recommendation models, achieving a perfect pass rate and demonstrating substantial performance gains compared to standard PyTorch implementations.

This automated approach reduces kernel development time from weeks to hours, accelerating the deployment of new models and features. By mitigating the challenges associated with programming new AI hardware, KernelEvolve lowers the barrier to entry for utilising custom accelerators and unlocks greater potential for innovation. The authors acknowledge the current implementation focuses on deep learning recommendation models and future work will explore expanding the framework’s capabilities to encompass a broader range of machine learning tasks and hardware platforms.

👉 More information
🗞 KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta
🧠 ArXiv: https://arxiv.org/abs/2512.23236

Tags:

AI accelerators CuTe deep learning recommendation models DLRM graph-based search heterogeneous hardware kernel generation KernelEvolve retrieval-augmented prompt synthesis Triton

Faster Recommendation Models, KernelEvolve Enables 100% Optimization across AI Accelerators

Triton 1D Convolution Benchmark Versus PyTorch

KernelEvolve Automates Deep Learning Kernel Optimisation

Automated Kernel Optimisation for Recommendation Models

Rohail T.

Latest Posts by Rohail T.:

Lasers Unlock New Tools for Molecular Sensing

Light’s Polarisation Fully Controlled on a Single Chip

New Quantum Algorithms Deliver Speed-Ups Without Sacrificing Predictability