Developing efficient artificial intelligence systems for specialised hardware remains a significant challenge, and researchers are now addressing this with a new end-to-end framework called AIE4ML. Dimitrios Danopoulos, Enrico Lupi, and Chang Sun, all from the European Organization for Nuclear Research (CERN), alongside Sebastian Dittmeier from Heidelberg University and Vladimir Loncar from the Institute of Physics Belgrade, present a system that automatically converts AI models into optimised firmware for the next generation of AMD AI Engines. This framework achieves near peak performance at the individual kernel level, and importantly, scales across the entire processing fabric while keeping all data on-chip, a crucial step towards ultra-low-latency applications. By seamlessly accepting existing, compressed AI models and delivering GPU-class throughput with microsecond latency, AIE4ML offers a practical solution for demanding environments such as the trigger systems used in particle physics experiments.
This work introduces AIE4ML, a comprehensive framework that automatically converts AI models into optimised firmware targeting AIE-ML generation devices, and maintains forward compatibility with the newer AIE-MLv2 architecture. At the single-kernel level, the team achieves performance approaching the architectural peak.
Versal ACAPs Accelerate Deep Learning Workloads
Modern deep neural networks, particularly transformers and MLPMixers, demand significant computational resources. Scientific applications, such as identifying particle jets in high-energy physics, require real-time or near-real-time inference. Traditional CPUs and GPUs often prove insufficient in terms of performance and energy efficiency, leading researchers to explore reconfigurable hardware like FPGAs and ACAPs. These devices offer the flexibility to be customised for specific workloads, providing a promising solution for accelerating deep learning tasks. The AMD Versal ACAP, with its dedicated AI Engine, is a key platform for this research.
The AI Engine is optimised for deep neural network inference and features a memory tile architecture crucial for performance. Researchers are employing various optimisation techniques and frameworks to maximise efficiency, including quantization, pruning, and custom dataflow designs. MLIR-based compilation provides a flexible and extensible compilation flow. Several frameworks compete and complement each other in this space, aiming to achieve significant improvements in performance, energy efficiency, and resource utilisation.
On-Chip Scaling for AI Engine Inference
AIE4ML represents a significant advancement in artificial intelligence inference, delivering a comprehensive solution for converting AI models into optimised firmware for Versal AI Engine (AIE) devices. The framework addresses challenges in efficiently deploying AI inference, particularly regarding tightly coupled execution, explicit data pathways, and local memory management. AIE4ML achieves high performance by exploiting key architectural features, sustaining high utilisation across both single and multiple processing tiles. The framework systematically derives deterministic and compact placements for multi-layer implementations, tailored to the physical grid of the device through a novel graph placement and search algorithm. This ensures efficient use of the AIE-ML’s resources and minimises data movement. AIE4ML seamlessly accepts quantized models imported from high-level tools, preserving bit-exactness and allowing for integration with existing AI development workflows.
In layer scaling benchmarks, the framework achieves up to 98.6% efficiency relative to the single-kernel baseline, utilising 97.4% of AIE tiles with entirely on-chip data movement. This demonstrates the effectiveness of the parallelisation strategy and the efficient use of the AIE-ML fabric. Evaluations across real-world model topologies reveal that AIE4ML delivers GPU-class throughput under microsecond latency constraints, making it a practical solution for ultra-low-latency environments, such as trigger systems in particle physics experiments.
Automated AI Firmware for AIE-ML Architecture
The research team has developed AIE4ML, a comprehensive framework that automatically converts artificial intelligence models into optimised firmware for AMD’s AIE-ML architecture. AIE4ML achieves high performance by exploiting key architectural features, sustaining high utilisation across both single and multiple processing tiles. Notably, AIE4ML delivers scalable, high-throughput inference performance while maintaining bit-exact accuracy when importing quantized models from standard machine learning toolflows. Evaluations across real-world model topologies demonstrate that the framework achieves GPU-class throughput under microsecond latency constraints, making it suitable for ultra-low-latency applications such as trigger systems in particle physics experiments. The system attained a peak throughput of 113.4 TOPS, significantly outperforming existing GPU, FPGA, and ANE implementations.
👉 More information
🗞 AIE4ML: An End-to-End Framework for Compiling Neural Networks for the Next Generation of AMD AI Engines
🧠 ArXiv: https://arxiv.org/abs/2512.15946
