The pursuit of increasingly powerful artificial intelligence models demands both innovative algorithms and the hardware to support them, and a new study by Quentin Anthony, Yury Tokpanov, Skyler Szot, and colleagues demonstrates a fully optimised system for training these models. The team meticulously characterises a complete training stack built around AMD’s MI300X GPUs and Pollara interconnect, delivering crucial insights into system performance and model design. This work goes beyond theoretical possibilities, showcasing a practical pathway to competitive large-scale pretraining and introducing ZAYA1, a 8. 3 billion parameter mixture-of-experts model that achieves performance comparable to, and in some cases exceeding, leading models such as Qwen3 and Gemma3 at a similar scale. By detailing both the hardware and software components, alongside a novel model architecture, the researchers establish a mature and optimised foundation for future advancements in artificial intelligence.
ZAYA-1 Training Infrastructure and Optimizations
This document details the infrastructure, optimization techniques, and performance measurements used to train the ZAYA-1 large language model, covering the specifications of the compute, storage, and login nodes within the cluster, alongside a detailed analysis of storage requirements and input/output performance. Scientists introduced Compressed Convolutional Attention (CCA), a novel attention mechanism designed to reduce memory and computational costs, and present communication performance data for inter-node and intra-node operations. Careful analysis of storage requirements and input/output performance ensures the storage system can handle the demands of large-scale training, employing a metric called the Scatter Factor to model real-world input/output behavior. CCA significantly reduces the memory and computational cost of attention, enabling the training of larger models, and a variant called CCGQA, combined with Grouped-Query Attention, achieved an eight-fold compression of the key-value cache. They developed ZAYA1, a 760 million active, 8. 3 billion total parameter MoE model, and implemented Compressed Convolutional Attention (CCA), which performs sequence-mixing in a compressed latent space, significantly reducing computational requirements for training and prefill while maintaining comparable performance to state-of-the-art attention methods. To enhance routing expressivity, scientists replaced the standard linear router with a multilayer perceptron and integrated Exponential Depth Averaging (EDA), which averages router representations with those from the previous layer using a learned coefficient.
Researchers implemented an advanced bias balancing scheme inspired by proportional, integral, derivative (PID) controllers from classical control theory, using AdamW optimization to improve convergence speed and stability. The ZAYA1 router down-projects the residual stream to a smaller dimension before applying EDA and a three-layer MLP to produce router scores, then selects experts via a top-k operation guided by learned bias balances. Detailed microbenchmarks were performed on core collectives, all-reduce, reduce-scatter, all-gather, and broadcast, representing the first such analysis at this scale on the Pollara network. Results demonstrate that careful consideration of compute-memory balance and network interface characteristics is essential for maximizing performance.
The team introduced ZAYA1-base, a 760 million active, 8. 3 billion total parameter mixture-of-experts (MoE) transformer, trained on this AMD platform. Performance evaluations reveal that ZAYA1-base achieves comparable results to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks.
AMD MI300X Achieves Leading Language Model Performance
This work presents the first comprehensive study of large-scale language model pretraining using AMD infrastructure, demonstrating the production readiness of the MI300X GPU and Pollara networking stack. Researchers established detailed networking benchmarks for Pollara across essential communication operations, and developed transformer sizing guidelines specifically for the MI300X architecture, alongside characterization of its memory bandwidth capabilities. The team also documented a complete cluster architecture, detailing crucial elements such as fault-tolerance systems and checkpoint reshaping utilities. Validation of these architectural innovations comes through the ZAYA1-base model, which achieves performance comparable to leading base models like Qwen3-4B and Gemma3-12B at a similar scale, and surpasses models including Llama-3-8B and OLMoE in reasoning, mathematics, and coding tasks. The study highlights the effectiveness of context-parallelism, a specialized router design, and lightweight residual scaling in optimizing model performance. The authors note that further improvements to the ZAYA1 model are planned, and future work will likely build upon the foundations established in this study.
👉 More information
🗞 Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
🧠 ArXiv: https://arxiv.org/abs/2511.17127
