Training increasingly large language models demands efficient use of computer memory, and a common approach involves simplifying the optimisation process with techniques like low-rank projection. However, many of these methods introduce biases that hinder their performance compared to training with all model parameters. Rui Pan from the University of Illinois Urbana-Champaign, Yang Luo from the National University of Singapore, Yuxing Liu from the University of Illinois Urbana-Champaign, and Yang You et al. address this challenge by investigating a layerwise sampling technique to remove these biases. Their work results in a new method, GaLore Unbiased with Muon, which not only matches the performance guarantees of existing optimisation algorithms but also maintains the memory efficiency of low-rank techniques, demonstrating improved performance in both language model fine-tuning and pretraining, and achieving more efficient use of the model’s parameters.
Gradient Unbiased Meta-Update for LLMs
Researchers have developed a new method, Gradient Unbiased Meta-Update (GUM), to address the significant memory demands of training large language models (LLMs) while preserving training accuracy. Existing techniques that reduce memory usage often introduce biases that hinder performance. GUM tackles this challenge by employing a novel sampling strategy that minimizes bias during updates, leading to more efficient and accurate training. Theoretical analysis confirms that GUM converges reliably, offering performance comparable to, and sometimes exceeding, full-rank updates and other low-rank methods.
GUM selectively updates a subset of model parameters in each iteration, ensuring a more representative and unbiased estimate of the gradient. This approach avoids the pitfalls of existing methods, such as GaLoRe, which can lead to uneven knowledge distribution within the model. Detailed analysis reveals that GUM maintains a more balanced spectrum of singular values, indicating a more stable and robust training process. The researchers acknowledge that careful hyperparameter tuning is necessary to manage variance introduced by the sampling process, and they suggest exploring variance reduction techniques for further improvement.
Experiments demonstrate that GUM consistently outperforms existing methods in terms of both performance and memory efficiency. The team has released the code for GUM, enabling other researchers to reproduce the results and build upon their work. This advancement promises to democratize access to LLM training, reducing the computational resources required and potentially unlocking the development of even larger and more powerful models. The researchers emphasize the importance of responsible AI development and deployment, highlighting the potential benefits of reducing the carbon footprint associated with LLM training.
Unbiased Low-Rank Optimization for Large Language Models
Researchers have introduced a new optimization method, GaLore Unbiased with Muon (GUM), designed to improve the efficiency of training large language models (LLMs) while maintaining accuracy. A key challenge in this field is balancing computational cost with model performance, and techniques like low-rank projection are often used to reduce memory requirements. However, many such methods introduce biases that can hinder convergence and limit performance. GUM addresses this issue by introducing an unbiased low-rank optimization approach, building upon the GaLore mechanism and the Muon algorithm. In a synthetic noisy linear regression problem, GUM successfully converged to a comparable accuracy with the full-parameter Muon baseline, while the biased GaLore method failed to converge.
This improvement is attributed to GUM’s ability to distribute knowledge more uniformly across layers, leading to better utilization of the parameter space. Further validation involved fine-tuning LLMs, including LLaMA-3-8B, Qwen-2. 5-7B, and Gemma-2-9B, on instruction-following and mathematical reasoning tasks. Using a rank of 512 for GaLore and 2 + 128 for GUM, the results show significant performance gains. For example, with LLaMA-3-8B, GUM achieved higher accuracy on both instruction-following and mathematical reasoning tasks compared to GaLore.
Similar improvements were observed with Qwen-2. 5-7B and Gemma-2-9B. Peak GPU memory usage was also carefully measured, revealing that GUM maintains comparable or reduced memory footprints relative to GaLore. These results demonstrate that GUM offers a compelling solution for efficient and accurate LLM training, delivering improved performance with comparable memory requirements.
Debiasing Low-Rank Training for Large Models
Scientists have developed a new optimization method, GUM, designed to improve the efficiency of training large language models (LLMs). A key challenge in this field is balancing computational cost with model performance, and techniques like low-rank projection are often used to reduce memory requirements. However, many such methods introduce biases that can hinder convergence and limit performance. GUM addresses this issue by introducing a layerwise sampling technique that debiases low-rank projection, restoring the theoretical convergence properties of standard optimization algorithms while maintaining memory efficiency.
The method involves randomly freezing layers during training, mitigating the biases inherent in low-rank projections and improving convergence. Analysis reveals that this improvement stems from a more uniform distribution of knowledge within the model’s layers, leading to better utilization of parameters and enhanced memorization capabilities. The researchers acknowledge that while their method shows promising results, further research is needed to explore its application to a wider range of models and datasets. The team has prioritized reproducibility by providing detailed implementation information, facilitating further investigation by the broader research community. This advancement promises to reduce the computational resources required for LLM training, potentially unlocking the development of even larger and more powerful models.
👉 More information
🗞 Unbiased Gradient Low-Rank Projection
🧠 ArXiv: https://arxiv.org/abs/2510.17802
