Root Optimizer Enhances Neural Network Training with Dimension-Robust Orthogonalization and Stability Mechanisms

Training increasingly large neural networks presents significant challenges, especially concerning algorithmic stability and precision, and researchers Wei He, Kai Han, and Hang Zhou, all from Huawei Noah’s Ark Lab, alongside colleagues, now address these issues with a novel approach to optimisation. The team introduces ROOT, a Robust Orthogonalized Optimizer, which enhances training stability through two key mechanisms designed to overcome the limitations of existing methods. ROOT employs a dimension-robust orthogonalization scheme, utilising adaptive Newton iterations to maintain consistent precision across different network architectures, and an optimisation-robust framework that suppresses disruptive noise while preserving essential gradient information. Extensive testing demonstrates that ROOT achieves faster convergence and superior performance compared to current optimizers, particularly in challenging, noisy training environments, establishing a new standard for robust and precise large-scale neural network training.

Orthogonal Updates for Faster Neural Network Training

Muon is a new optimization algorithm designed to improve the training of large neural networks, with a particular focus on hidden layers. The core idea is to encourage updates to be orthogonal to previous updates, preventing collapse into a narrow subspace and leading to faster convergence and better generalization. Muon specifically targets hidden layers, and is designed to be computationally efficient for large-scale training. The research demonstrates improvements in both training speed and performance, highlighting Muon’s scalability for very large models. The study builds upon techniques like Kronecker-Factored Approximate Curvature and the Shampoo preconditioned optimizer, while also considering gradient orthogonalization techniques such as Natural Gradient Descent and Decoupled Weight Decay. Datasets and benchmarks including HellaSwag, Winogrande, ARC, Fineweb, and GPT-4 were utilized for evaluation. Researchers identified that existing orthogonalization-based optimizers suffer from precision gaps across varying matrix dimensions and heightened vulnerability to outlier-induced gradient noise. To overcome these challenges, the team engineered a novel approach centered on dual robustness mechanisms. ROOT incorporates an adaptive Newton iteration scheme with fine-grained, dimension-specific coefficients to achieve algorithmic robustness, replacing fixed-coefficient approximations with a dynamic system that adjusts to the spectral properties of weight matrices.

The team meticulously tailored coefficients to specific matrix sizes, guaranteeing consistent precision across diverse configurations. An optimization-robust framework using proximal optimization with soft-thresholding suppresses outlier-induced gradient noise while preserving meaningful gradient directions, stabilizing training without compromising convergence speed. Extensive experiments on large language model pre-training and fine-tuning demonstrate ROOT’s superior performance and faster convergence compared to state-of-the-art optimizers, particularly in noisy and non-convex scenarios.

ROOT Algorithm Boosts Language Model Training

The work presents ROOT, a robust optimization algorithm designed to improve the training of large language models. Experiments demonstrate that ROOT achieves significantly improved robustness and faster convergence compared to existing optimizers like Muon and Adam. During pre-training on a 1 billion parameter Transformer model, ROOT reached a final training loss of 2. 5407, a 0. 01 improvement over the Muon baseline.

ROOT incorporates a dimension-robust orthogonalization scheme using adaptive Newton iterations, tailored to specific matrix sizes, consistently maintaining a lower relative error throughout the training process. Analysis revealed that ROOT minimizes approximation error across diverse matrix dimensions. An optimal percentile threshold of 0. 90 effectively isolates noise while preserving gradient integrity. Evaluations on standard language model benchmarks, including HELLASWAG, BoolQ, and PIQA, confirm that ROOT enhances both training convergence and final model quality, achieving competitive or superior performance across diverse academic tasks, with an average score of 60. 12 across all benchmarks.

ROOT Optimizer Stabilizes Large Language Model Training

This work presents ROOT, a novel robust orthogonalized optimizer designed to address limitations in training large-scale language models. Researchers developed ROOT to improve both the precision and stability of the optimization process, which becomes increasingly challenging as model size grows. The method achieves this through a dimension-robust orthogonalization scheme utilizing adaptive Newton iterations, and an optimization-robust framework employing proximal outlier suppression. Experimental validation demonstrates that ROOT outperforms existing optimizers, including Muon and Adam-based methods, particularly in noisy and non-convex training scenarios.

Significant improvements in accuracy on the CIFAR-10 dataset, with gains of up to 3. 77% achieved using different quantile percentile settings, suggest that ROOT effectively mitigates gradient noise and enhances generalization capabilities, even when applied to non-language modalities. This research establishes a new paradigm for developing robust optimization frameworks, potentially enabling more reliable and efficient training of next-generation AI systems.

👉 More information
🗞 ROOT: Robust Orthogonalized Optimizer for Neural Network Training
🧠 ArXiv: https://arxiv.org/abs/2511.20626

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Topology-aware Machine Learning Enables Better Graph Classification with 0.4 Gain

Llms Enable Strategic Computation Allocation with ROI-Reasoning for Tasks under Strict Global Constraints

January 10, 2026
Lightweight Test-Time Adaptation Advances Long-Term EMG Gesture Control in Wearable Devices

Lightweight Test-Time Adaptation Advances Long-Term EMG Gesture Control in Wearable Devices

January 10, 2026
Deep Learning Control AcDeep Learning Control Achieves Safe, Reliable Robotization for Heavy-Duty Machineryhieves Safe, Reliable Robotization for Heavy-Duty Machinery

Generalist Robots Validated with Situation Calculus and STL Falsification for Diverse Operations

January 10, 2026