Optimizing GEMM on Versal ACAP with Machine Learning Achieves 2.5x Performance and 1.23x Energy Gains

General Matrix Multiplication, a core operation underpinning fields from scientific computing to deep learning, frequently limits both performance and energy efficiency, particularly in resource-constrained environments. Ilias Papalamprou, Dimosthenis Masouros, and Ioannis Loudaros, all from the National Technical University of Athens, along with colleagues, address this challenge by developing an automated framework specifically for Versal ACAP architectures. Their innovative approach moves beyond traditional analytical methods by employing a Machine Learning model, trained through approximately 6000 on-board experiments, to intelligently explore design possibilities and optimise GEMM mappings. This results in significant improvements, demonstrating geomean gains of 1. 23x in throughput and 1. 25x in energy efficiency compared to existing state-of-the-art frameworks when tested on a Versal VCK190 platform.

Recognizing that GEMM often limits performance, particularly in power-constrained edge devices, the team addressed the challenge of mapping this operation across the device’s diverse components. The study pioneered a method that leverages Machine Learning to guide Design Space Exploration, enabling the identification of GEMM mappings optimized for either performance or energy efficiency. The core of this work involved training a Machine Learning model on approximately 6000 on-board experiments, each representing a different GEMM mapping configuration.

These experiments systematically varied parameters to explore the vast design space created by the heterogeneous architecture. This approach contrasts with prior analytical methods, which often struggle with accuracy and overlook energy-performance trade-offs. The team discovered that maximizing throughput does not automatically guarantee optimal energy efficiency, demonstrating a 22. 4% difference between the highest-throughput and most energy-efficient designs for a typical GEMM workload. To further refine the process, scientists employed tiling strategies, splitting computations into smaller blocks to improve data reuse and alleviate bandwidth limitations.

The framework meticulously analyzes the impact of tiling factors on both performance and power consumption, considering the interplay between data transfers to and from memory and the effective utilization of the AI Engines. Accurate power modeling proved crucial, as vendor-provided tools often diverge from real hardware measurements, hindering energy-aware optimization. Evaluation on the Versal VCK190 demonstrated geomean improvements of 1. 23x in throughput and 1. The work addresses a critical bottleneck in many modern workloads, particularly deep learning, by optimizing how these calculations are mapped onto the device’s heterogeneous architecture. Researchers developed an automated system driven by a Machine Learning model, trained on approximately 6000 on-board experiments, to intelligently explore design options and identify optimal configurations. This approach overcomes limitations of prior analytical methods, which often overlook crucial energy-performance trade-offs and can be inaccurate in predicting real-world results.

Experiments reveal that the highest throughput design is, in some cases, 22. 4% less energy-efficient than the most energy-efficient configuration, demonstrating the importance of considering both metrics simultaneously. The team’s Machine Learning model predicts latency, power consumption, and resource utilization with significantly improved accuracy, 51% higher than analytical modeling approaches. This enhanced prediction capability allows the framework to efficiently identify Pareto-optimal mappings, delivering configurations tailored for either high throughput or energy-efficient operation. Evaluations on the Versal VCK190 demonstrate geomean improvements of 1.

23x, with peaks reaching 2. 52x, in throughput. Simultaneously, the framework achieves geomean improvements of 1. 25x, up to 2. 69x, in energy efficiency compared to state-of-the-art approaches. The team created an open-source dataset of approximately 6000 GEMM mappings with experiments on Versal VCK190, further contributing to the field and enabling future research. This work addresses the challenge of optimizing both performance and energy efficiency, areas often overlooked in prior research. The team’s approach utilizes a Machine Learning model, trained through extensive on-board experimentation, to accurately predict the performance and power consumption of different hardware configurations. Evaluation on the Versal VCK190 demonstrates significant improvements over existing methods, with the framework achieving, on average, a 1.

23-fold increase in throughput and a 1. 25-fold improvement in energy efficiency. The Machine Learning model closely matches the true performance limits achievable on the hardware, offering a robust and efficient design space exploration tool. Future research directions include exploring the application of this Machine Learning-driven approach to other computational kernels and investigating methods to further refine the model’s predictive capabilities. This work represents a substantial advance in optimizing heterogeneous computing platforms for demanding applications, particularly in resource-constrained environments.

👉 More information
🗞 Optimizing GEMM for Energy and Performance on Versal ACAP Architectures
🧠 ArXiv: https://arxiv.org/abs/2511.06907

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

High Throughput Prediction of Moiré Materials Achieves Rapid Analysis of 10,000-Atom Twisted Structures

High Throughput Prediction of Moiré Materials Achieves Rapid Analysis of 10,000-Atom Twisted Structures

December 22, 2025
Gradient-enabled Pre-Training Achieves Scalable Quantum Circuit Training Beyond Classical Simulation

Gradient-enabled Pre-Training Achieves Scalable Quantum Circuit Training Beyond Classical Simulation

December 22, 2025
Quantum Rotor Approach Advances Ultracold Boson Physics at Finite Temperatures

Quantum Rotor Approach Advances Ultracold Boson Physics at Finite Temperatures

December 22, 2025