Transformer-based deep learning models now power many applications, but deploying them on devices with limited resources, such as laptops and games consoles, presents a significant challenge for speed and efficiency. Aadesh Deshmukh, Venkata Yaswanth Raparti, and Samuel Hsu from AMD address this problem with a new framework called Zen-Attention, designed to optimise performance on neural processing units. The team’s research tackles the complex task of efficiently mapping dynamic attention layers onto these processors, systematically exploring different configurations for data handling and memory usage. Results demonstrate that Zen-Attention achieves up to four times faster attention processing and a 32% reduction in overall network latency compared to conventional methods, paving the way for more powerful and efficient artificial intelligence on a wider range of devices.
Current deep learning models demand increasing computational resources, and the industry is turning to neural processing units (NPUs) for superior performance-per-watt. However, efficiently mapping dynamic attention layers to these NPUs remains a challenging task. This paper introduces Zen-Attention, a framework that optimises DRAM bandwidth utilisation in the attention layer of models by systematically exploring the complex design space of layer folding, tiling, and data movement.
Zen-Attention Optimizes NPU Transformer Performance
Researchers are developing methods to improve the performance of attention mechanisms, a core component of modern transformer architectures, on specialized hardware called Neural Processing Units (NPUs). Attention mechanisms are often limited by the speed of data transfer rather than computational power, creating a memory bottleneck. This work introduces Zen-Attention, a framework that employs a two-step process of graph optimization and tiling to overcome these limitations. First, the framework analyzes the computational graph and combines multiple operations into a single, folded operation, reducing the number of data transfers and maximizing the utilization of the NPU’s on-chip memory.
The tiling process then determines how to divide the input data into sub-volumes that fit within the NPU’s memory hierarchy. A key aspect of Zen-Attention is its handling of data transformations, such as transposing matrix dimensions and padding input tensors. The framework incorporates a “Folding-Preserving Transpose” mechanism that efficiently handles transpositions without requiring additional memory buffers and leverages the NPU’s padding capabilities to minimize dedicated padding operations. Experiments conducted on AMD Ryzen AI processors with a 32-core NPU demonstrate significant performance improvements.
The results show up to a four-fold reduction in attention block latency and up to a 32% improvement in end-to-end network latency for certain models. These gains are particularly pronounced for models with larger sequence and context lengths, where data transfer bottlenecks are more significant. Even for smaller models, the framework demonstrates lower latency and reduced DRAM bandwidth utilization, potentially benefiting concurrent applications running on the system. The framework was tested on a variety of models, including ViT-base-patch, CLIP, and BERT, showcasing its adaptability and effectiveness across different architectures.
Zen-Attention Optimizes Transformers for Limited Resources
Researchers have developed Zen-Attention, a framework that significantly optimizes the performance of transformer-based deep learning models on devices with limited resources, such as laptops and gaming consoles. The team addresses the challenge of efficiently mapping dynamic attention layers to neural processing units (NPUs), which are increasingly used for their superior performance-per-watt. Zen-Attention systematically explores the interplay of layer folding, tiling, and data movement to maximize efficiency on systems that share memory between the NPU and the host processor. This approach unlocks substantial gains by carefully managing how data is accessed and processed within the NPU’s architecture.
Evaluation of representative transformer models demonstrates that Zen-Attention can achieve up to a fourfold improvement in the latency of the attention block and up to a 32% improvement in overall network latency compared to standard approaches. The benefits are most pronounced in models where memory bandwidth is a significant bottleneck, although even in scenarios where computation is the limiting factor, the framework still delivers around an 8% latency reduction and lowers memory bandwidth usage. The authors acknowledge that the improvements vary depending on the model and input dimensions. This is particularly beneficial for systems running multiple applications simultaneously, as it frees up valuable memory resources. Evaluations across diverse models, including ViT-base-patch-16, CLIP, and BERT, consistently demonstrate the framework’s effectiveness, with even the BERT model showing improvements in resource utilization. Zen-Attention represents a significant step forward in deploying powerful deep learning models on resource-constrained devices, paving the way for more efficient and responsive applications.
👉 More information
🗞 Zen-Attention: A Compiler Framework for Dynamic Attention Folding on AMD NPUs
🧠 ArXiv: https://arxiv.org/abs/2508.17593
