Low-Precision Training Instability in Large Language Models and Solutions.

Training large language models utilising reduced precision formats, such as those found in Blackwell architecture, consistently exhibits stochastic instabilities in loss during training, particularly at larger scales. These instabilities stem from multiplicative gradient bias arising from the quantisation of layer normalisation parameters and activations, but can be mitigated through hybrid precision configurations.

The escalating computational demands of training large language models necessitate exploration of reduced-precision arithmetic to improve efficiency, yet these approaches often introduce instabilities during the learning process. Researchers at Harvard University, including Chloe Huangyuan Su, Mujin Kwun, Stephanie Gil, Sham Kakade, and Nikhil Anand, investigate these challenges in a study focused on ‘Microscaling’ formats, a technique utilising shared scaling within parameter blocks to extend representational range and accelerate computation. Their work, entitled “Characterization and Mitigation of Training Instabilities in Microscaling Formats”, details an analysis of nearly one thousand language models trained from scratch, revealing consistent stochastic instabilities in loss functions at larger computational scales. Through controlled experiments and a simplified proxy model, the team identifies a mechanism linked to the quantisation of layer-norm parameters and activations. It demonstrates that these instabilities can be mitigated through adaptive precision schemes, achieving performance comparable to full-precision training.

The increasing computational burden of training large language models (LLMs) compels investigation into lower-precision arithmetic formats to enhance efficiency. Recent hardware developments, such as NVIDIA’s Blackwell architecture and its associated ‘Microscaling’ (MX) formats, which represent numbers using fewer bits, promise substantial gains. However, reducing numerical precision during training introduces instabilities, primarily due to a multiplicative bias in gradients arising from the quantisation of layer normalisation parameters and a limited number of activations. Layer normalisation is a technique used to stabilise training by adjusting the inputs to each layer of a neural network, and quantisation refers to the process of reducing the number of bits used to represent numerical values.

This gradient bias can initiate runaway divergence during training, where the model’s parameters rapidly increase or decrease, leading to instability. Researchers now demonstrate that a fixed choice of precision format is not optimal, and that dynamic adjustment of precision in situ—during the training process itself—can prevent divergence and facilitate robust training. This involves monitoring the training process and switching between different precision levels as needed, based on observed instabilities.

Applying these principles to full-scale LLM training, they reveal that hybrid precision configurations—strategically combining various precision levels within the model—can achieve performance comparable to that of full-precision training, which uses the highest level of numerical precision. This offers a practical route to computational efficiency without significant performance degradation. The research demonstrates that certain layers or parameters are more sensitive to reduced precision than others, and can therefore benefit from maintaining higher precision, while others can operate effectively at lower precision.

These findings provide a more nuanced understanding of the challenges inherent in reduced-precision training and offer actionable strategies for stabilising the process. The publicly released code facilitates further exploration and development of robust, efficient LLM training methodologies, encouraging innovation in this crucial field. This work underscores a critical trade-off between computational efficiency and training stability, necessitating careful consideration of precision schemes to fully leverage the capabilities of next-generation hardware accelerators.

👉 More information
🗞 Characterization and Mitigation of Training Instabilities in Microscaling Formats
🧠 DOI: https://doi.org/10.48550/arXiv.2506.20752

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

Scott Aaronson, leading theoretical computer scientist, joins StarkWare

Scott Aaronson, leading theoretical computer scientist, joins StarkWare

February 8, 2026
MIT Research Reveals Cerebellum’s Role in Language Network, Expanding Brain Mapping

MIT Research Reveals Cerebellum’s Role in Language Network, Expanding Brain Mapping

February 6, 2026
ETH Zurich Researchers Achieve "Surgery" on Qubits, Advancing Quantum Error Correction

ETH Zurich Researchers Achieve “Surgery” on Qubits, Advancing Quantum Error Correction

February 6, 2026