Researchers are tackling the challenge of efficiently pre-training ever-larger language models, a crucial step in developing advanced artificial intelligence. Ruijie Zhang, Yequan Zhao, and Ziyue Liu, from the University of California, Santa Barbara, alongside Zhengyang Wang, Dongyang Li, and Yupeng Su, present TEON, a novel approach that builds upon the Muon optimiser by extending gradient orthogonalization beyond single layers. Their work represents a significant advance because it treats neural network gradients as a higher-order tensor, offering improved convergence guarantees and demonstrably better performance on both GPT and LLaMA architectures, ranging from 60 million to 1 billion parameters, across various scales and approximation methods.
Tensorized optimisation overcomes gradient challenges in large language model pre-training, leading to improved scalability and efficiency
Scientists have developed a new optimizer, termed TEON, that significantly enhances the pre-training of large language models. This research addresses the resource-intensive nature of training models like GPT, DeepSeek, and LLaMA by improving pre-training efficiency. The team achieved this by moving beyond layer-wise gradient orthogonalization, a technique used in the MUON optimizer, to a tensorized approach that considers correlations across layers.
TEON models gradients as a structured higher-order tensor, enabling the optimizer to capture these inter-layer relationships during training and mitigate gradient rank collapse. The study reveals a principled generalization of MUON, offering improved convergence guarantees substantiated by theoretical analysis and ablation studies.
Researchers developed a practical instantiation of TEON, validated through extensive experimentation on both GPT-style models, ranging from 130M to 774M parameters, and LLaMA-style models, spanning 60M to 1B parameters. Experiments demonstrate that TEON consistently improves both training and validation perplexity across different model scales, showcasing its robustness under various approximate Singular Value Decomposition schemes.
This breakthrough establishes a novel approach to optimizer design for large language models, moving beyond independent layer optimization. By treating gradients as a higher-order tensor, TEON captures cross-layer dependencies, leading to more efficient and stable training. The work opens avenues for further research into tensorized optimization methods and their application to increasingly large and complex neural networks. Preliminary results, as shown in Figure 1, indicate that TEON consistently outperforms MUON in pre-training GPT-Small on 10 billion FineWeb tokens, with validation perplexity consistently lower across different orthogonalization methods.
Tensorized Gradient Orthogonalization and Large Language Model Evaluation are crucial for efficient training
Scientists introduced TEON, a novel optimizer building upon the MUON framework to address limitations in large language model pre-training. The study pioneered a tensorized approach to gradient orthogonalization, extending beyond layer-wise operations to capture cross-layer correlations. Researchers formulated TEON by considering gradients as a structured higher-order tensor, enabling joint optimization across multiple layers.
This method achieves improved convergence guarantees compared to layer-wise MUON, substantiated by theoretical analysis and ablation studies. Experiments employed GPT-style architectures, ranging from 130M to 774M parameters, and LLaMA-style models, spanning 60M to 1B parameters. The team ran five trials with differing random seeds for each configuration to estimate standard deviations and ensure statistical robustness.
Validation perplexity served as the primary metric, with lower values indicating superior performance. The system delivers consistent improvements in both training and validation perplexity across all scales tested, demonstrating TEON’s broad applicability. To implement TEON, scientists first defined tensor notation, representing a tensor as a multidimensional array with order indicating the number of dimensions.
They then adapted the MUON update rule to operate on these tensors, replacing layer-wise matrix orthogonalization with tensor-level operations. Specifically, the study utilized mode-1 matricization, slicing the tensor into column fibers and arranging them as columns of a matrix for processing. The Newton-Schulz iteration process, commonly used to approximate the singular value decomposition in MUON, was adapted for tensor operations.
Researchers also incorporated a dimensional pre-factor of p m/n, as suggested by previous work, to enhance scalability. This innovative methodology enables TEON to effectively capture and leverage inter-layer dependencies, resulting in enhanced pre-training performance.
TEON optimizer enhances large language model pre-training and convergence guarantees significantly
Scientists have developed TEON, a novel optimizer that builds upon the Muon optimizer to improve pre-training of large language models. The research team proposes TEON as a principled generalization of Muon, extending gradient orthogonalization beyond individual layers by modelling gradients as a structured higher-order tensor.
Experiments demonstrate that TEON consistently improves both training and validation perplexity across a range of model scales, from 60M to 774M parameters. Results show TEON achieves improved convergence guarantees compared to layer-wise Muon, validated through theoretical analysis and ablation studies.
The team evaluated TEON on both GPT-style models, ranging from 130M to 774M parameters, and LLaMA-style models, ranging from 60M to 1B parameters. Validation perplexity, a key metric for language model performance, was consistently lower with TEON across all tested model configurations. Measurements confirm TEON’s strong robustness under various approximate Singular Value Decomposition schemes.
Specifically, pre-training GPT-Small on 10 billion FineWeb tokens revealed that TEON consistently outperformed Muon across different orthogonalization methods, as illustrated in Figure 1 of the work. The study reports that TEON captures cross-layer correlations during training by applying gradient orthogonalization on a higher-order tensor, rather than treating each layer independently.
This breakthrough delivers a practical instantiation of TEON, guided by theoretical analysis and validated through extensive ablation studies. The team measured performance gains in pre-training efficiency, demonstrating the potential for reduced resource intensity in developing large foundation models. Tests prove that TEON’s tensorized approach to gradient orthogonalization offers a significant advancement over existing layer-wise methods, paving the way for more efficient and effective large language model training.
Tensor-level gradient orthogonalisation enhances large language model pre-training stability and performance
Researchers have developed TEON, a novel optimization method that builds upon the existing Muon optimizer for pre-training neural networks. Unlike Muon, which applies gradient orthogonalization at the layer level, TEON extends this principle to a tensor level, modelling stacked gradient matrices as a higher-order tensor.
This tensor-level approach aims to improve the utilization of cross-layer gradient information during training, potentially leading to greater efficiency. The study demonstrates that TEON consistently enhances both training and validation perplexity across various model scales, ranging from 60 million to 1 billion parameters, using GPT-style and LLaMA-style architectures.
Theoretical analysis supports the improved convergence properties of TEON compared to layer-wise Muon, and ablation studies validate practical implementation guidance derived from this analysis. However, the authors acknowledge that approximating orthogonalization for efficiency can lead to degradation, particularly when using the PolarExpress method with larger stacking group sizes.
This work introduces a significant advancement in optimization techniques for large-scale neural network pre-training. By effectively leveraging cross-layer gradient information through tensorized orthogonalization, TEON offers a pathway to reduce the computational resources and energy consumption associated with pre-training foundation models. Future research could explore the application of TEON to other optimization algorithms and investigate methods to mitigate the performance degradation observed with certain approximation schemes.
👉 More information
🗞 TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training
🧠 ArXiv: https://arxiv.org/abs/2601.23261
