Researchers are increasingly focused on Neural Collapse (NC), the emergence of highly symmetric geometric structures within the representations learned by deep neural networks during training. Jim Zhao, Tin Sum Cheng, Wojciech Masarczyk, and Aurelien Lucchi, working collaboratively across the University of Basel and Warsaw University of Technology, demonstrate that the choice of optimisation algorithm is a critical, and previously overlooked, factor in whether or not Neural Collapse occurs. Challenging the assumption that NC is universal, the team introduce a novel diagnostic metric, NC0, and provide theoretical evidence showing that adaptive optimizers employing decoupled weight decay, such as AdamW, cannot facilitate the emergence of NC. Their analysis of Stochastic Gradient Descent (SGD) and SignGD, supported by an extensive empirical evaluation of 3,900 training runs, reveals distinct dynamics and highlights the importance of weight-decay coupling in shaping the implicit biases of different optimisation methods, offering a foundational step towards a more complete understanding of deep learning behaviour.
Deep neural networks now reveal more about how they learn, not just that they do. The method used to train these networks strongly influences the final arrangement of knowledge within them. Understanding this connection offers a route to building more efficient and reliable artificial intelligence systems. Scientists are beginning to unravel the mysteries of how deep neural networks achieve their remarkable performance, moving beyond simply observing success to understanding the underlying mechanisms.
A phenomenon called neural collapse (NC) has emerged as a potential key to this understanding, describing a surprising self-organisation of network representations during the final stages of training. Existing explanations for neural collapse have often overlooked a critical component: the optimisation algorithm itself, implicitly assuming that it is a universal outcome regardless of how the network learns.
Now, research challenges this assumption, demonstrating that the choice of optimizer, the algorithm used to adjust the network’s parameters, plays a decisive role in whether or not neural collapse occurs. This work introduces NC0, a new diagnostic metric designed to more accurately track and theoretically analyse the progression towards collapse. Unlike earlier metrics which can plateau at non-zero values, NC0 offers a clearer signal, indicating whether true collapse is happening or merely an illusion.
Through both theoretical proofs and extensive experimentation, spanning 3,900 training runs, researchers have found a surprising distinction between optimizers like AdamW and SGD. Specifically, networks trained with AdamW consistently fail to exhibit the full characteristics of neural collapse, while those trained with SGD often do. Beyond simply observing this difference, the study traces the root cause to how weight decay, a technique used to prevent overfitting, is implemented in each algorithm.
This discovery highlights the often-overlooked influence of weight decay coupling in shaping the implicit biases of optimizers and opens new avenues for designing more effective and predictable learning algorithms. Also, the accelerating effect of momentum on NC when using SGD provides the first insight into this aspect of the process. At the heart of the issue lies the difficulty in quantifying the phenomenon itself.
Traditional NC metrics, designed to capture the geometric alignment of network representations, are often difficult to interpret and can be misleading under realistic training conditions. To address this limitation, NC0 has been developed. Its convergence to zero is presented as a necessary condition for neural collapse, offering a more definitive criterion for assessment.
The implications of this work extend beyond simply improving the measurement of neural collapse. By theoretically demonstrating that certain optimizers, such as AdamW, cannot achieve NC under specific conditions, researchers have identified a fundamental constraint on the learning process. For instance, the study proves that SGD, SignGD with coupled weight decay (a variant of Adam), and SignGD with decoupled weight decay (a variant of AdamW) exhibit distinct behaviours regarding NC0.
This understanding could lead to the development of new optimisation strategies that are specifically designed to promote or prevent neural collapse, depending on the desired outcome. The findings have implications for a range of applications, from improving the generalisation performance of machine learning models to enhancing their ability to detect out-of-distribution data.
By revealing the subtle interaction between optimizers, weight decay, and neural collapse, this research provides a important step towards a more complete and subtle understanding of deep learning. Future work can build upon these insights to design more effective algorithms and unlock the full potential of artificial neural networks. The scale of the empirical validation is particularly compelling.
With 3,900 training runs encompassing diverse datasets, architectures, optimizers, and hyperparameters, the results provide strong evidence supporting the theoretical claims. Unlike previous studies that often relied on simplified models or limited datasets, this work demonstrates the robustness of the findings across a wide range of conditions. The accelerating effect of momentum on NC, observed when training with SGD, represents a novel insight into the dynamics of neural collapse.
The research highlights the importance of considering the specific implementation details of optimisation algorithms. The distinction between Adam and AdamW, despite their algorithmic similarities, underscores the critical role of weight decay coupling. While Adam, with its coupled weight decay, can enable neural collapse, AdamW, with its decoupled approach, appears to hinder it.
This subtle difference has been largely overlooked in prior work, and its identification represents a significant contribution to the field. The work provides the first theoretical explanation for why the emergence of neural collapse is dependent on the chosen optimizer. It also sheds light on the overlooked role of weight-decay coupling in shaping the implicit biases of these algorithms.
Since understanding these biases is important for building reliable and predictable machine learning systems, this research has the potential to markedly advance the field. Previous analyses have focused on the loss field or the formation of collapsed representations. However, this work shifts the focus to actively controlling collapse, opening up possibilities for building more predictable and efficient AI systems.
Introducing ‘NC0’, a new diagnostic, is a clever move, providing a clearer signal for when collapse is likely to happen. Researchers can now move beyond simply noting that collapse exists and begin to test specific hypotheses about its causes. Initial analysis of the novel NC0 metric reveals that, under specific conditions, it converges to zero at an exponential rate proportional to the weight decay when employing stochastic gradient descent.
Across 3,900 training runs, researchers observed that SignGD with decoupled weight decay, mirroring the implementation in AdamW, resulted in NC0 converging to a positive constant, indicating a failure to achieve neural collapse. Conversely, SignGD utilising coupled weight decay, akin to Adam, exhibited a non-monotonic NC0 trajectory, initially increasing before in the end decreasing.
Beyond these core findings, the work demonstrates an accelerating effect of momentum on neural collapse when training with SGD, a previously unobserved phenomenon in this context. Specifically, the rate at which NC0 diminishes is enhanced by the inclusion of momentum, extending beyond the simple convergence of the training loss. Detailed examination of the NC0 metric’s behaviour under varying learning rates showed that, as the learning rate decreases towards zero, NC0 also vanishes, reinforcing its role as a diagnostic for neural collapse.
These results were obtained using a diverse range of datasets, network architectures, optimizers, and hyperparameters, strengthening the generalizability of the conclusions. The study highlights a critical distinction between Adam and AdamW, with AdamW consistently failing to drive NC0 to zero, even while Adam successfully achieves this. The analysis of SignGD further clarifies the role of weight decay coupling, with decoupled weight decay preventing NC0 from reaching zero.
These theoretical results are supported by extensive experimentation, demonstrating the robustness of the findings across a broad spectrum of conditions. The researchers carefully tracked NC0 alongside other established NC metrics, providing a thorough assessment of the collapse phenomenon. The study found that the convergence rate of NC0 with SGD is directly proportional to the magnitude of the weight decay, suggesting a strong link between regularization and the formation of the collapsed representation.
With the introduction of NC0, the researchers were able to definitively determine whether NC occurred, even in scenarios where traditional NC metrics provided ambiguous signals. Rather than relying on plateaus in existing metrics, NC0’s convergence to zero serves as a rigorous criterion for confirming the presence of neural collapse. Under realistic training regimes, the metric offers a more definitive assessment than previously available.
👉 More information
🗞 Optimizer choice matters for the emergence of Neural Collapse
🧠 ArXiv: https://arxiv.org/abs/2602.16642
