Researchers have identified a surprising parallel between physics and artificial intelligence, revealing spontaneous symmetry breaking within natural language processing models. Shalom Rosner, Ronit D Gross, and Ella Koresh, all from Bar-Ilan University, alongside Ido Kanter, demonstrate this phenomenon occurring during both the pre-training and fine-tuning of these models, even with deterministic processes and limited architectures. This research is significant because it uncovers how individual nodes within a network specialise in learning specific tokens or labels, exhibiting a crossover in learning ability as the network scales, and offering a novel perspective on the emergence of complex function from simpler components, unlike traditional physical systems where microscopic states aren’t directly linked to task goals. Their findings, based on the BERT-6 architecture trained on Wikipedia and the FewRel classification task, suggest a fundamental principle governing information processing in neural networks.

Symmetry breaking in NLP attention heads often leads

Scientists have determined that Spontaneous symmetry breaking is a key mechanism underlying deep learning models. The deep learning architecture splits the learning task among its parallel learning components, such as filters in convolutional neural networks and attention heads in transformer architectures. This spontaneous symmetry breaking can be observed even at the single-node level, where a node is capable of learning a small number of tokens after pre-training or identifying several labels after fine-tuning for a specific classification task. Each token is embedded into a 768 dimensional vector using the embedding layer, encoding lexical meaning and position.
The input sequence length was fixed at 128 tokens, with padding using a [PAD] token. The QKV attention comprises 12 heads with 768 input and 64 output dimensions per head and input token. The first five transformer encoder blocks and QKV attention of the sixth block remained frozen. The 128 × 768 output nodes of the scaled dot-product attention were connected to a classifier head trained on 90,000 Wikipedia paragraphs to minimize the loss function. This small dataset was selected due to limited computational resources, but was shown to maintain qualitative results obtained using the entire Wikipedia dataset.

The accuracy was estimated using 90,000 validation dataset comprising 28,273 tokens, indicating an average accuracy per token (APT) of 0.36 for ~23,188 tokens predicted correctly at least once. Consequently, the 30,522 output units representing the tokens were influenced only by the 64 nodes of that head. The validation dataset was propagated through the first five pre-trained transformer blocks, as well as through the unsilenced output units of the QKV attention of the sixth block, generating a 30,522 × 30,522 confusion matrix. Each matrix element (i, j) represents the number of times output unit j, representing token j, was selected by validation inputs with masked token i.

The APT of the positive diagonal elements (0.043) fluctuates slightly among heads, which is three orders of magnitude greater than the APT value of a random-guess (1/30,522). The diagonal confidence, defined as the ratio between the sum of the positive diagonal elements and their corresponding columns, averaged 0.176. The fraction of elements in the occupied columns with zero diagonal elements is less than a few percent, meaning diagonal confidence and full confusion matrix confidence are typically similar. The total number of tokens recognized by all heads was 13,243 out of the 28,273 in the validation dataset, with most unrecognized tokens having low frequencies.

For frequencies greater than 100, the heads recognized 8,428 tokens out of 9,386. This occurred following the random initial conditions of the weights and biases of each head. In case of deterministic dynamics, starting from the same initial conditions for the 12 heads within each layer, their weights will remain the same throughout the learning process, resulting in one head per layer and vanishing learning. The APT of each individual head was considerably lower than the APT obtained for the entire twelve heads (~0.36), indicating the importance of cooperation among the heads. The training process splits the learning task among the heads using spontaneous symmetry breaking, represented by the confusion matrices, but there are additional hidden correlations among the output fields of the heads, contributing further to the average APT. This gap was attributed to events in which a mask token was selected following the summation of the output fields emerging from the 12 heads, contradicting the decision following the individual confusion matrices. The computational capability of a single node, as well as a subgroup of nodes in the six-encoder block of the pre-trained BERT-6, was evaluated similarly to the head analysis.

Symmetry breaking emerges in NLP node learning

This crossover is governed by a trade-off between a decrease in accuracy due to random guessing amongst increased possible outputs, and an enhancement resulting from nodal cooperation. Results demonstrate an average accuracy per token (APT) of 0.36 for approximately 23,188 tokens that were predicted correctly at least once in the masked process, using a validation dataset comprising 28,273 tokens. The BERT-6 architecture consists of an embedding layer encoding 30,522 tokens into 768-dimensional vectors, followed by six transformer encoders with query, key, and value (QKV) attention. Each QKV attention layer comprises 12 heads with 64-dimensional output nodes per head and input token, resulting in a total of 768 output dimensions. The findings indicate that this symmetry breaking can occur at the smallest scale of a finite network, evolving under deterministic dynamics.

Node Specialisation Drives Symmetry Breaking Emergence in developing

This phenomenon occurs even with deterministic dynamics and finite training architectures, differing from traditional observations in statistical mechanics which typically require stochastic dynamics and infinite systems. These findings were demonstrated using a BERT-6 architecture trained on Wikipedia and fine-tuned for the FewRel classification task, revealing a connection between nodal symmetry breaking and FewRel accuracy. The authors acknowledge that their work focuses on a specific architecture and dataset, potentially limiting the generalizability of the findings. Future research could explore whether this behaviour extends to other NLP tasks and model architectures, potentially uncovering a universal principle governing learning capacity. This research establishes a novel link between symmetry breaking in physics and the behaviour of neural networks, offering a new perspective on how these models learn and generalise.

👉 More information
🗞 Single-Nodal Spontaneous Symmetry Breaking in NLP Models
🧠 ArXiv: https://arxiv.org/abs/2601.20582

Tags:

natural language processing pre-training Spontaneous Symmetry Breaking

Muhammad Rohail T.

NLP Symmetry Breaking in AI Training

Symmetry breaking in NLP attention heads often leads

Symmetry breaking emerges in NLP node learning

Node Specialisation Drives Symmetry Breaking Emergence in developing

Latest Posts by Muhammad Rohail T.:

Quantum Light Speeds Atomic Ionization

Quantum States Predictably Distribute with Noise

Quantum Networks: Unknown State Verification Limit