General Matrix Multiplication (GEMM) forms a cornerstone of modern scientific computing, yet achieving peak performance on multi-core systems remains a significant challenge. Yufan Xia, Marco De La Pierre, and Amanda S. Barnard, from the Australian National University, alongside Giuseppe Maria Junior Barca, tackled this problem by developing a machine learning approach to runtime optimisation. Their research introduces a proof-of-concept Architecture and Data-Structure Aware Linear Algebra (ADSALA) library, which dynamically selects the optimal number of threads for GEMM tasks. This innovative method, tested on both Intel Cascade Lake and Zen 3 high-performance computing nodes, demonstrates a substantial 25 to 40 per cent speedup over traditional BLAS implementations for GEMM operations utilising up to 100MB of memory, representing a potentially transformative step towards more efficient scientific simulations.

Single-thread GEMM implementations have benefited from extensive optimisation through techniques such as blocking and autotuning, achieving considerable performance gains. However, determining the optimal number of threads to minimise execution time for multi-thread GEMM on modern multi-core shared memory systems presents a significant challenge. This research investigates strategies for efficiently parallelising GEMM to maximise throughput on these complex architectures. The work aims to identify and implement techniques that effectively utilise available cores while minimising overhead associated with thread management and data contention, ultimately improving the scalability and performance of GEMM.

Automated BLAS Tuning via Machine Learning

This paper presents an extensive and technically detailed study on automating the tuning of linear algebra algorithms—particularly BLAS routines—using machine learning, with the aim of overcoming the limitations of traditional manual optimization. Conventional tuning is time-consuming, hardware-specific, and requires significant expert knowledge, making it poorly suited to modern computing environments where architectures and workloads change rapidly. The authors propose a data-driven system that learns to predict optimal algorithm configurations, such as thread count and blocking parameters, for a given problem size and hardware platform, thereby reducing manual effort while improving performance.

At the core of the approach is performance modeling, where execution time is predicted as a function of problem characteristics, hardware features, and algorithmic parameters. The study explores a broad range of machine learning techniques, including lasso regression for feature selection, Bayesian regression for uncertainty-aware predictions, random forests and XGBoost for high-accuracy ensemble learning, support vector regression for high-dimensional feature spaces, nearest-neighbor methods for capturing local performance patterns, and density-based outlier detection to improve data quality. A strong emphasis is placed on feature engineering, with the use of data normalization techniques, interaction terms, and carefully designed sampling strategies to generate informative training data.

The workflow begins with systematic data collection by running BLAS routines under many different configurations and measuring their execution times. Relevant features are then extracted from problem sizes, hardware specifications, and algorithm parameters. Machine learning models are trained on this data, evaluated for predictive accuracy, and finally used to select optimal configurations for new problem instances at runtime. Experiments are conducted across multiple hardware platforms and BLAS operations, such as matrix multiplication and LU decomposition, using established libraries.

One of the key strengths of the approach is its ability to automate performance tuning while adapting to different systems and workloads. The authors demonstrate that their models can accurately predict optimal thread counts for GEMM operations, often selecting fewer than the maximum available threads. This counterintuitive result leads to substantial performance gains by reducing synchronization overhead and improving cache efficiency, with measured wall-time reductions of up to 70 percent in some cases. Importantly, these improvements are achieved even when using highly optimized libraries such as Intel MKL and AMD BLIS, highlighting the value of intelligent runtime configuration.

The paper also discusses potential limitations, including the cost of data collection, challenges in generalizing to unseen hardware, the complexity of feature selection, and the interpretability of complex machine learning models. There is also consideration of the runtime overhead introduced by prediction and the need for periodic retraining as systems evolve. Despite these challenges, the authors show that installation-time benchmarking can effectively tailor the model to a specific system, ensuring accurate and efficient predictions during execution.

Overall, this work demonstrates the significant potential of machine learning for automating performance optimization in high-performance computing. By dynamically selecting optimal configurations based on learned performance models, the proposed system improves the efficiency of linear algebra operations on modern multi-core architectures. The results suggest particular benefits for systems with high core counts and large matrix sizes. Future directions include extending the approach to other BLAS routines, exploring transfer and online learning strategies, and adapting the framework to heterogeneous architectures combining CPUs and accelerators.

👉 More information
🗞 A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication
🧠 ArXiv: https://arxiv.org/abs/2601.09114

Tags:

architecture-aware linear algebra autotuning BLAS GEMM HPC node architectures Machine Learning multi-thread GEMM runtime performance optimisation.

Machine Learning Achieves Runtime Optimisation for GEMM with Dynamic Thread Selection

Automated BLAS Tuning via Machine Learning

Rohail T.

Latest Posts by Rohail T.:

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

Protected: Silicon Unlocks Potential for Long-Distance Quantum Communication Networks