Researchers are addressing a critical bottleneck in training large language models: the efficient scaling of matrix-based optimizers such as Shampoo, Muon, and SOAP. Liangyu Wang from KAUST, Siqi Zhang from Alibaba Group, and Junjie Wang from Peking University, et al., present Canzona, a novel framework designed to reconcile the need for holistic optimizer updates with the fragmented tensor distribution inherent in distributed training systems like Megatron. This work is significant because it overcomes the limitations of both synchronous and layer-wise partitioning strategies by decoupling logical assignment from physical parameter distribution. Through alpha-Balanced Static Partitioning for Data Parallelism and an Asynchronous pipeline with Micro-Group Scheduling for Tensor Parallelism, Canzona achieves a 1.57x speedup in end-to-end iteration time and a 5.8x reduction in optimizer step latency on the Qwen3 model family, demonstrating substantial performance gains on large-scale distributed training.
The increasing size of LLMs necessitates efficient training algorithms, with matrix-based optimizers demonstrating superior convergence compared to conventional methods. However, these optimizers require holistic updates of parameters, conflicting with the fragmented tensor distribution inherent in distributed training frameworks like Megatron.
This work introduces a solution that decouples the logical assignment of optimizer tasks from the physical distribution of parameters, enabling efficient and scalable training. This ensures atomic operations and minimizes communication overhead during optimizer updates. Results reveal a 1.57x speedup in end-to-end iteration time, signifying a substantial improvement in training efficiency. Furthermore, the framework reduces optimizer step latency by 5.8x compared to baseline methods, highlighting its ability to accelerate the crucial optimization phase of LLM training.
The core innovation lies in Canzona’s ability to maintain the efficiency of established parallel architectures while accommodating the demands of advanced matrix-based optimizers. By enforcing parameter atomicity through strategic partitioning and employing asynchronous computation, the framework unlocks significant performance gains. This advancement paves the way for training even larger and more complex language models, accelerating progress in artificial intelligence and natural language processing.
Implementation of asynchronous data and tensor parallelism with parameter decoupling
Canzona, a unified, asynchronous, and load-balanced framework, addresses the conflict between matrix-based optimizers and distributed training paradigms. This strategy optimises the static layout by redistributing whole parameters to equalize workload across processing ranks.
Parameters remain replicated across ranks during forward and backward passes, adhering to the ZeRO-1 protocol. This pipeline performs communication-efficient optimizer updates via fused All-to-All operations.
The system fundamentally decouples the logical assignment of optimizer tasks from the physical distribution of parameters, enabling zero-communication during optimizer updates and preserving ZeRO-style geometric alignment. Each optimizer task is assigned a designated host rank and executed asynchronously in parallel, facilitating compute, compute overlap and reducing optimizer-step makespan.
Addressing the challenge of load imbalance, the work formulates partitioning as a load-balancing optimisation problem. These strategies aim to eliminate computational stragglers and pipeline bubbles, effectively equalising execution times across heterogeneous workloads.
Evaluations conducted on 256 GPUs training the Qwen3 model family, ranging from 1.7B to 32B parameters, demonstrated a 1.57x speedup in end-to-end iteration time and a 5.8x reduction in optimizer step latency compared to baseline methods. Optimizer step latency was reduced by 5.8x compared to baseline methods, indicating a significant acceleration in the optimization process. This performance improvement stems from a unified, asynchronous, and load-balanced approach to distributed matrix-based optimizers, specifically designed for large language models.
The data parallel strategy enforces a zero-communication layout during optimizer steps, protecting the global network and enabling efficient scaling. Intra-node tensor parallelism utilizes an asynchronous pipeline with Micro-Group Scheduling, batching fragmented updates to conceal reconstruction overhead.
This technique effectively eliminates redundant computation, as demonstrated by contrasting the standard tensor parallel synchronous compute with the proposed asynchronous compute strategy. Specifically, the asynchronous compute unit abstracts parameter updates as atomic “Compute Tasks” assigned to Host Ranks, ensuring strict locality for optimizer states and avoiding unnecessary data transmission.
The system constructs Micro Groups by sorting parameters based on computational cost and employing a greedy rollback algorithm to balance the load across ranks. This algorithm, detailed as Algorithm 2, prioritizes minimizing computational imbalance and maximizing group saturation. The lifecycle of a task group involves four key stages: All-to-All gradient gathering, asynchronous computation on Host Ranks, All-to-All update scattering, and local parameter updates.
By fusing asynchronous gather operations into single All-to-All collectives, the framework saturates communication bandwidth and avoids overhead from small messages. This approach efficiently routes gradient shards to Host Ranks where locally resident optimizer states await, enabling matrix-based operations like the Muon step. The hierarchical partitioning strategy further optimizes the process by minimizing computational imbalance and maximizing group saturation across the entire model.
Decoupled optimisation and parallel scheduling accelerate large language model training
Canzona, a unified, asynchronous, and load-balanced framework, addresses the conflict between matrix-based optimizers and distributed training strategies for large language models. Evaluations using the Qwen3 model family, scaled up to 32 billion parameters across 256 GPUs, demonstrate substantial performance gains.
The framework achieves a 1.57-fold speedup in end-to-end iteration time and reduces optimizer step latency by a factor of 5.8 compared to baseline methods. This improvement stems from the elimination of communication overhead and effective neutralisation of load imbalance inherent in second-order optimisation techniques.
The authors acknowledge that existing solutions often fall short, with synchronous approaches introducing redundancy and layer-wise partitioning failing to fully reconcile geometric constraints. Canzona successfully addresses these limitations by maintaining efficiency while ensuring strict atomicity requirements are met. Future research could explore the application of this framework to even larger models and diverse hardware configurations, potentially further optimising performance and scalability in the rapidly evolving landscape of large language model training.
👉 More information
🗞 Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers
🧠 ArXiv: https://arxiv.org/abs/2602.06079
