Scientists have long recognised optimisation instability as a critical challenge in training deep neural networks, but the underlying mechanisms remain incompletely understood. Hengjie Cao, Mengyi Chen, and Yifeng Yang, from Fudan University, alongside Dong et al., now demonstrate that the emergence and amplification of singularities within the parametric space significantly contribute to this instability. Their analysis reveals a reinforcing cycle where singularities grow with each gradient update, aligning with representations and ultimately escalating the risk of disruptive loss explosions. This research is significant because it identifies a fundamental limitation of current optimisation techniques and introduces Parametric Singularity Smoothing (PSS), a novel method to mitigate these instabilities, demonstrably improving training efficiency, restoring trainability and enhancing generalisation across a range of neural network architectures and datasets.
Their analysis reveals that parametric singularities inevitably grow with gradient updates and further intensify alignment with representations, leading to increased singularities in the representation space.
They demonstrate that the gradient Frobenius norms are bounded by the top singular values of the weight matrices. As training progresses, the mutually reinforcing growth of weight and representation singularities, termed the curse of singularities, relaxes these bounds, escalating the risk of sharp loss explosions. Extensive experiments across diverse datasets, architectures, and optimizers demonstrate that PSS mitigates instability, restores trainability even after failure, and improves both training efficiency and generalization.
Training instability, often manifesting as sudden loss explosions or erratic fluctuations, remains a long-standing and critical challenge in deep neural network (DNN) optimization. These instabilities frequently arise without early warning signals and can occur at any stage of training, often leading to catastrophic collapse of the learning process.
As a result, developing a stable training configuration is labor-intensive and error-prone; even small modifications in model architecture, optimizer, or data can undermine trainability and necessitate costly re-training. This problem is especially severe in large-scale foundation models, where each training run consumes weeks of computational time on thousands of GPUs.
In these contexts, repeated instability-driven failures hinder iterative development, making a stable and restart-free training approach crucial. Existing Methods to address instability are largely empirical, such as gradient clipping, and are sensitive to hyperparameter choices, often either failing to prevent divergence or excessively constraining learning dynamics.
While theoretical insights link instabilities to sharp loss landscapes characterized by large Hessian eigenvalues, the high complexity of loss landscape limits previous analyses from fully revealing the underlying mechanism of training instability. In this work, researchers analyze training instability through the lens of growing singularities in the parametric space, a less-investigated perspective.
Empirical investigations reveal that, across layers, the weight matrices, such as the query and key matrices in Transformers, become increasingly singular as training progresses, consistent with recent findings. More strikingly, they find that imminent catastrophic loss explosions are preceded by rapidly escalating singularities in both the parameter space and the representation space, characterized by growing rank deficiencies.
These phenomena exhibit a mutually reinforcing trend; growing singularities in weight matrices lead to more degenerate representations, which in turn reinforce the singularity of subsequent gradient updates. This feedback loop culminates in a self-amplifying mechanism that triggers sharp, unrecoverable loss spikes.
They term this failure mode the curse of singularities, a fundamental yet previously underexplored cause of optimization instability in DNNs. To substantiate these observations, they adopt a simplified one-layer Transformer model and provide analysis grounded in the QK-gradient approximation. They prove that the stable rank (a soft surrogate for matrix rank) of weight matrices is provably lower-bounded to increase following each gradient update.
Furthermore, they show that alignment between weight and representation singular vectors becomes tighter after backpropagation, further increasing representational singularity and reinforcing the instability. Crucially, they demonstrate that the Frobenius norm of gradients is bounded by the top singular value of the weight matrices, which means as the network becomes more singular, it also becomes more susceptible to large, unstable gradient updates.
This dynamic relaxation of gradient norm bounds provides a mechanistic link between the growing singularities and sharp loss landscapes observed in prior work. Empirical evaluations across diverse models, datasets, and optimization Methods demonstrate that PSS consistently prevents training collapses, improves convergence efficiency, enhances generalization performance, and can restore trainability even after the occurrence of perceivable instabilities.
Their contributions are summarized as follows: they identify a mutually reinforcing growth of singularities in weight and representation matrices, termed the curse of singularities, as a primary cause of training instability in deep neural networks. Researchers begin with preliminary knowledge and notations. Next, they empirically demonstrate rank co-collapse in representation and parametric spaces, termed the curse of singularities, and explore its mechanism using a simplified one-layer transformer, providing key insights.
Finally, they link singularity amplification to training instability. This section outlines key definitions, assumptions, and a one-layer transformer framework, forming the foundation for their subsequent analysis. They introduce the Stable Rank (SR) to quantify parametric singularities.
For a weight matrix W ∈ Rm×d, they perform Singular Value Decomposition (SVD) to obtain singular values {σi}min(m,d) i=1, left singular vectors {ui}m i=1 ∈Rm, and right singular vectors {vi}d i=1 ∈Rd, such that W = Pmin(m,d) i=1 σiuiv⊤ i. Throughout the paper, they assume all singular values are sorted in descending order, namely σ1 ≥σ2 ≥…σr 0, and they denote the i-th singular value of W as σi(W).
Definition 2.1 (Parametric Singularity) states the parametric singularity of a matrix W is its stable rank, defined as the ratio of the squared Frobenius norm ∥W∥2 F to the squared spectral norm ∥W∥2 2: SR(W) = ∥W∥2 F ∥W∥2 2 = Pr i=1 σ2 i σ2 1. SR provides a smoothed measure of matrix rank, reflecting weight matrix singularity.
A low SR indicates a pronounced singularity, where a few singular vectors dominate the parametric space. Similarly, they define the singularity in the representation space. Let X = [x1, x2, … , xT ] ∈ RT ×d represent an input of T tokens and WQK is query-key parameter matrix.
Definition 2.2 (Representation Singularity) states that SR(Z) quantifies the representation singularity. They employ a Taylor expansion around the origin, with coefficients γi:= ∇iS(0) = 1 T ei − 1 T 2 1 and γi 0:= S(0)i = 1 T, where ei is the i-th one-hot vector and 1 is the all-ones vector. Thus, γi i = 1 T − 1 T 2 and γi a = −1 T 2 for a ∈[T] \ {i}.
To stabilize the approximation, they apply a truncation strategy with a fixed threshold c. When the absolute value of the approximation term exceeds c, the slope is clipped to 0: γi i = T −1 T 2, if γi iωi + 1 T γi 0 ∈[−c, c] 0, otherwise γi a = −1 T 2, if γi aωa + 1 T γi 0 ∈[−c, c] 0, otherwise. The resulting approximation in vector form is: S(ω) ≈ S(ω) = Γ⊤ω + γ0, where Γ:= [ γ1, γ2, … , γT ] and γ0 = [ γ1 0, γ2 0, … , γT 0 ]⊤.
They consider a one-layer transformer, simplifying the feed-forward network by assuming an identity activation function and omitting layer normalization. With X ∈RT ×d as a sequence input with T tokens, using the approximated softmax S, the output of a transformer with an attention layer A and a feedforward layer F is given by: F(A(X))T = WF WV X⊤ S(ω) + xT = WF WV X⊤Γ⊤ω + WV X⊤ γ0 + xT, where WF:= WF2WF1 + I, with WF1, WF2 ∈Rd×d as feed-forward parameters.
The query-key parameter matrix WQK:= W⊤ QWK ∈Rd×d is optimized jointly, with WQ and WK constrained to be identical throughout training, ensuring WQK is treated as a unified parameter. The training task is causal language modeling, predicting the (T + 1)-th token y:= xT +1 ∈Rd given T contextual tokens X.
The objective, using squared loss, is: J (Θ) := 1 2E∥y −F(A(X))T ∥2, where Θ denotes model parameters, and the expectation is over the dataset. The simplified gradient of the objective with respect to WQK is: 1 d X i,j,a,b∈[T ] E h γi a γj bPij(x⊤ b WQKxT )xax⊤ T i, where Pij:= (x⊤ i W⊤ V W⊤ F WF WV xj) is independent of WQK, under the assumption that the other model parameters do not influence the analysis of the singularity of WQK.
Empirical Results in Fig0.1(a)(b) illustrate the evolution of parametric singularity SR(W) and representation singularity SR(Z) during training. Both metrics exhibit a sharp decline early in training, stabilizing at low values with minor fluctuations thereafter.
Singularities align and amplify during training with suboptimal hyperparameters, leading to instability and poor generalization
Parametric singularity SR(W) and representation singularity SR(Z) both exhibited a sharp decline early in training, stabilizing at low values with minor fluctuations thereafter. This trend was consistently observed across all network components, as detailed in supplementary material within Appendix E.
The singularity alignment φ, quantifying the relationship between parametric and representation spaces, sharply increased early in training and remained high, ranging between 0.6 and 1.0. This high alignment indicated strong reinforcement between parametric and representation singularities, particularly when combined with suboptimal hyperparameters such as large learning rates.
Theorem 2.1 demonstrates that the gradient of the loss J with respect to WQK amplifies parametric singularities, with the amplification approximated as PO(σ2 1μ2 1φ2) X t=1 μ2 tβtβ⊤ t. For large sequence lengths T and when SR(WK) exceeds 1 + SR(Z)−1 φ2, a gradient update with learning rate η reduces the stable rank of both WQ and WK by △SR(WQ) = △SR(WK) = −ηPO(μ2 1φ2)μ2 1R.
This reduction drives SR(WQ) and SR(WK) towards SR(Z), thereby intensifying parametric singularities. Analysis of a one-layer transformer revealed that gradient updates emphasize the dominant singular direction in the representation space. The simplified gradient of the objective with respect to WQK, for a sufficiently large sequence length T, is 1 d X i,j,a,b∈[T ] E h γi a γj bPij(x⊤ b WQKxT )xax⊤ T i, where Pij is independent of WQK under the assumption that other model parameters do not influence the singularity analysis.
The softmax function was approximated using a piecewise linear approach, employing a Taylor expansion around the origin with coefficients γi:= ∇iS(0) = 1 T ei − 1 T 2 1 and γi 0:= S(0)i = 1 T. A truncation strategy with a fixed threshold c was applied to stabilize the approximation, clipping the slope to zero when the absolute value of the approximation term exceeded c.
Mitigating instability through smoothing of singular spectra in neural networks improves generalization performance
Parametric singularity smoothing effectively addresses training instabilities observed in neural networks by mitigating the emergence and amplification of singularities within the parametric space. Analysis reveals that these singularities grow during gradient updates and align with representations, ultimately increasing instability and the potential for sharp loss explosions.
This phenomenon, termed the curse of singularities, arises from the relaxation of bounds on gradient Frobenius norms as training progresses, due to the mutually reinforcing growth of singularities in both weight matrices and representations. While existing methods such as adaptive optimizers, parameter freezing, and spectral normalization offer some stability, they often prove ineffective with large learning rates or after instability has already begun.
The research establishes that rank co-collapses in both parametric and representation spaces contribute significantly to training instabilities. By smoothing the singular spectrum of weight matrices, the proposed method effectively prevents unstable training without compromising performance and can even restore trainability after instability is detected.
As a flexible and easily integrated module, this approach offers potential benefits for large language models by reducing the need for extensive retraining and saving computational resources. Further investigation into parametric space instability provides a new avenue for analysing and improving neural network training dynamics.
👉 More information
🗞 Dispelling the Curse of Singularities in Neural Network Optimizations
🧠 ArXiv: https://arxiv.org/abs/2602.01308
