Asymmetric Tile Buffering Achieves Higher Arithmetic Intensity, Benefiting General Matrix Multiplication Performance

Efficiently performing the matrix multiplications that underpin modern artificial intelligence presents a significant computational challenge, and researchers continually seek ways to optimise these processes. Chengyue Wang, Wesley Pang, and Xinrui Wu, alongside colleagues, investigate a novel approach called asymmetric tile buffering, which breaks from conventional methods that use equal tile sizes for input and output data. Their work demonstrates, for the first time, that decoupling these tile sizes offers substantial practical benefits, achieving up to a 4. 54x speedup on the latest XDNA2 AI Engine. By developing a performance model that balances increased computational efficiency with the overhead of switching between data blocks, the team establishes a new performance record and provides valuable insight into optimising matrix multiplication for AI workloads.

Traditional approaches employ symmetric tile buffering, where the buffered tile size of the input data matches the output size. This research introduces asymmetric tile buffering (ATB), a technique that separates the buffered tile dimensions of input and output data. The research demonstrates, for the first time, that ATB is both practical and highly beneficial, offering a pathway to improved performance in artificial intelligence computations. To understand this effect, the team developed a performance model that incorporates both the benefits of ATB, namely higher arithmetic intensity, and its overheads, such as increased kernel switching costs, providing insight into how to select effective tiling factors. The core idea is to reduce buffer requirements and increase arithmetic intensity beyond what’s achievable with traditional symmetric buffering. Key findings demonstrate that ATB improves GEMM throughput by up to 40% compared to optimized symmetric kernels, achieving a 4. 54x speedup over the state-of-the-art GEMM implementation. The research also demonstrates that ATB is a general tiling strategy applicable to various NPU architectures, and presents kernel design strategies that leverage ATB for performance gains. Performance scales effectively with matrix size, maintaining high throughput even for smaller problems.

Asymmetric Tile Buffering Accelerates Matrix Multiplication

Researchers have achieved a significant breakthrough in matrix multiplication, a core operation in artificial intelligence, by implementing a technique called asymmetric tile buffering (ATB). Experiments reveal a remarkable speedup of up to 4. 54x, increasing performance from 4. The team developed a detailed performance model that explains the benefits of ATB, specifically its ability to increase arithmetic intensity, while also accounting for the overhead of increased kernel switching costs. This model provides insights into selecting optimal tiling factors for maximizing performance. Further optimization involved double buffering of input registers, a technique where two disjoint register sets are alternated, allowing operands to be preloaded while compute operations are in progress, effectively hiding load latency.

Measurements confirm that this approach reduces RAW dependencies between loads and compute, improving efficiency. Researchers also explored input sharing across multiple parallel accumulation chains, reducing the number of loads per VMAC and alleviating pressure on bandwidth. By overlapping successive chain clusters in a software-pipelined schedule, the team minimized gaps between clusters, further increasing utilization.

Asymmetric Buffering Accelerates Matrix Multiplication Performance

This work introduces asymmetric tile buffering, a novel tiling strategy that improves the efficiency of general matrix multiplication, a core operation in modern artificial intelligence workloads. Researchers demonstrate that decoupling the buffered tile dimensions of input and output operands reduces buffer pressure and enhances arithmetic intensity beyond conventional symmetric buffering techniques. 54x and throughput increases of up to 40% compared to highly optimised symmetric kernels. While the current implementation demonstrates substantial gains, future work will focus on automating the process of kernel generation through the development of automated search and scheduling algorithms.

👉 More information
🗞 Can Asymmetric Tile Buffering Be Beneficial?
🧠 ArXiv: https://arxiv.org/abs/2511.16041

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum-inspired Networks Enable Robust Reasoning, Advancing Logical Consistency in Large Language Models

Quantum-inspired Networks Enable Robust Reasoning, Advancing Logical Consistency in Large Language Models

January 13, 2026
Autonomous Driving Advances with DrivoR’s Multi-Camera Feature Compression and Trajectory Scoring

Autonomous Driving Advances with DrivoR’s Multi-Camera Feature Compression and Trajectory Scoring

January 13, 2026
Extended Heun Hierarchy Advances Quantum Geometry of Seiberg-Witten Curves for Gauge Theories

Extended Heun Hierarchy Advances Quantum Geometry of Seiberg-Witten Curves for Gauge Theories

January 13, 2026