Training increasingly large neural networks presents significant challenges, especially concerning algorithmic stability and precision, and researchers Wei He, Kai Han, and Hang Zhou, all from Huawei Noah’s Ark Lab, alongside colleagues, now address these issues with a novel approach to optimisation. The team introduces ROOT, a Robust Orthogonalized Optimizer, which enhances training stability through two key mechanisms designed to overcome the limitations of existing methods. ROOT employs a dimension-robust orthogonalization scheme, utilising adaptive Newton iterations to maintain consistent precision across different network architectures, and an optimisation-robust framework that suppresses disruptive noise while preserving essential gradient information. Extensive testing demonstrates that ROOT achieves faster convergence and superior performance compared to current optimizers, particularly in challenging, noisy training environments, establishing a new standard for robust and precise large-scale neural network training.
Orthogonal Updates for Faster Neural Network Training
Muon is a new optimization algorithm designed to improve the training of large neural networks, with a particular focus on hidden layers. The core idea is to encourage updates to be orthogonal to previous updates, preventing collapse into a narrow subspace and leading to faster convergence and better generalization. Muon specifically targets hidden layers, and is designed to be computationally efficient for large-scale training. The research demonstrates improvements in both training speed and performance, highlighting Muon’s scalability for very large models. The study builds upon techniques like Kronecker-Factored Approximate Curvature and the Shampoo preconditioned optimizer, while also considering gradient orthogonalization techniques such as Natural Gradient Descent and Decoupled Weight Decay. Datasets and benchmarks including HellaSwag, Winogrande, ARC, Fineweb, and GPT-4 were utilized for evaluation. Researchers identified that existing orthogonalization-based optimizers suffer from precision gaps across varying matrix dimensions and heightened vulnerability to outlier-induced gradient noise. To overcome these challenges, the team engineered a novel approach centered on dual robustness mechanisms. ROOT incorporates an adaptive Newton iteration scheme with fine-grained, dimension-specific coefficients to achieve algorithmic robustness, replacing fixed-coefficient approximations with a dynamic system that adjusts to the spectral properties of weight matrices.
The team meticulously tailored coefficients to specific matrix sizes, guaranteeing consistent precision across diverse configurations. An optimization-robust framework using proximal optimization with soft-thresholding suppresses outlier-induced gradient noise while preserving meaningful gradient directions, stabilizing training without compromising convergence speed. Extensive experiments on large language model pre-training and fine-tuning demonstrate ROOT’s superior performance and faster convergence compared to state-of-the-art optimizers, particularly in noisy and non-convex scenarios.
ROOT Algorithm Boosts Language Model Training
The work presents ROOT, a robust optimization algorithm designed to improve the training of large language models. Experiments demonstrate that ROOT achieves significantly improved robustness and faster convergence compared to existing optimizers like Muon and Adam. During pre-training on a 1 billion parameter Transformer model, ROOT reached a final training loss of 2. 5407, a 0. 01 improvement over the Muon baseline.
ROOT incorporates a dimension-robust orthogonalization scheme using adaptive Newton iterations, tailored to specific matrix sizes, consistently maintaining a lower relative error throughout the training process. Analysis revealed that ROOT minimizes approximation error across diverse matrix dimensions. An optimal percentile threshold of 0. 90 effectively isolates noise while preserving gradient integrity. Evaluations on standard language model benchmarks, including HELLASWAG, BoolQ, and PIQA, confirm that ROOT enhances both training convergence and final model quality, achieving competitive or superior performance across diverse academic tasks, with an average score of 60. 12 across all benchmarks.
ROOT Optimizer Stabilizes Large Language Model Training
This work presents ROOT, a novel robust orthogonalized optimizer designed to address limitations in training large-scale language models. Researchers developed ROOT to improve both the precision and stability of the optimization process, which becomes increasingly challenging as model size grows. The method achieves this through a dimension-robust orthogonalization scheme utilizing adaptive Newton iterations, and an optimization-robust framework employing proximal outlier suppression. Experimental validation demonstrates that ROOT outperforms existing optimizers, including Muon and Adam-based methods, particularly in noisy and non-convex training scenarios.
Significant improvements in accuracy on the CIFAR-10 dataset, with gains of up to 3. 77% achieved using different quantile percentile settings, suggest that ROOT effectively mitigates gradient noise and enhances generalization capabilities, even when applied to non-language modalities. This research establishes a new paradigm for developing robust optimization frameworks, potentially enabling more reliable and efficient training of next-generation AI systems.
👉 More information
🗞 ROOT: Robust Orthogonalized Optimizer for Neural Network Training
🧠 ArXiv: https://arxiv.org/abs/2511.20626
