Research demonstrates that language models, similar to humans, struggle with sequences possessing high local entropy – a measure of local uncertainty derived from information theory. Experiments utilising both perturbed text and defined probabilistic automata reveal that increased local entropy correlates with reduced learning performance in Transformer and LSTM models.
The capacity of neural networks to process language relies on inherent assumptions – known as inductive biases – that guide learning from limited data. Recent research investigates whether these biases mirror the constraints observed in human language processing. Taiga Someya (University of Tokyo) and colleagues, including Anej Svete, Brian DuSell, Mario Giulianelli, and Ryan Cotterell (all ETH Zürich), alongside Timothy J. O’Donnell (McGill University), present a quantitative method to examine these biases. In their paper, “Information Locality as an Inductive Bias for Neural Language Models”, the team introduces $\lambda$-local entropy – a metric quantifying how well preceding text predicts subsequent symbols – and demonstrates that language models struggle with sequences exhibiting high $\lambda$-local entropy, suggesting a parallel between machine and human sensitivity to local statistical structure.
Quantifying Linguistic Complexity: Entropy and Cross-Entropy Correlate in Neural Language Models
This research establishes a strong correlation between m-local entropy and next-symbol cross-entropy, demonstrating that m-local entropy effectively quantifies the complexity and predictability of a language corpus. This work builds upon existing research in language modelling and cognitive science, offering a new tool for analysing and improving neural network performance.
Experiments involving systematic perturbations of the BLLIP corpus reveal how alterations to sentence order impact both metrics. Reversing the order of sentences induces the most substantial increase in both m-local entropy and cross-entropy, signifying a considerable rise in modelling difficulty, while localised shuffling within defined windows also increases these values, albeit to a lesser extent. The extent of the impact from localised shuffling depends on the window size (K) employed, indicating that the scope of disruption influences the degree of difficulty. These results extend beyond the BLLIP corpus, as analogous trends are observed when applying the same methodology to the BabyLM corpus, strengthening the generalisability of the findings.
Researchers systematically perturbed a baseline corpus, creating variations including reversed sequences, shuffled words based on position (even/odd), and localised shuffling within defined windows, to validate this finding. They then calculated m-local entropy for each transformed corpus and assessed the corresponding performance of LSTM and Transformer models in predicting the next symbol, consistently revealing a strong correlation between higher m-local entropy and increased difficulty for the models. This increased difficulty manifests as higher next-symbol cross-entropy scores, providing a quantifiable measure of the impact of corpus complexity on model performance. Scatter plots and Pearson correlation coefficients further substantiate this relationship across different model architectures, alphabet sizes, and numbers of states, solidifying the statistical significance of the findings.
Central to this work is the introduction of m-local entropy, a novel information-theoretic measure that quantifies the local uncertainty within a corpus by assessing how effectively preceding symbols disambiguate the next symbol in a sequence. Calculated using average lossy-context surprisal, m-local entropy provides a means to objectively measure the predictability of a text, offering a valuable tool for analysing and comparing different corpora. Lossy-context surprisal quantifies how unexpected a symbol is given its preceding context, with ‘lossy’ referring to the truncation of longer contexts to a defined length m.
The study concludes that m-local entropy serves as a reliable proxy for assessing corpus difficulty for language models, suggesting that neural language models, similar to humans, are highly sensitive to the local statistical structure of text. This sensitivity implies that the predictability of a corpus is a key determinant of model performance, highlighting the importance of considering linguistic structure when designing and evaluating language models. The BLLIP corpus was used under the terms of the BLLIP 1987-89 corpus license.
Quantitative analysis demonstrates that higher values of m-local entropy consistently correspond to increased cross-entropy, indicating a strong predictive relationship between the two measures. This relationship holds true across different model architectures, suggesting that the underlying principle applies broadly to various language modelling approaches.
Future work should investigate the application of this framework to a wider range of corpora, encompassing diverse linguistic styles and domains, to assess the generalisability of the findings and identify potential limitations. Exploring the interplay between m-local entropy and other established measures of linguistic complexity, such as perplexity and information density, could provide a more comprehensive understanding of language modelling challenges. Furthermore, extending the perturbation techniques to include semantic and syntactic transformations would offer valuable insights into the specific linguistic features that most significantly impact model performance.
Investigating the potential of m-local entropy as a feature for improving language model training could lead to more robust and efficient models. Researchers could explore incorporating m-local entropy into the loss function or using it to guide the selection of training data, potentially leading to improved generalization performance. Additionally, exploring the use of m-local entropy for evaluating the quality of generated text could provide a more objective and informative metric than traditional measures like BLEU or ROUGE.
By providing a quantifiable measure of linguistic complexity, this research contributes to a growing body of work exploring the intersection of linguistics, computer science, and cognitive science, offering new insights into the complexities of language processing.
👉 More information
🗞 Information Locality as an Inductive Bias for Neural Language Models
🧠 DOI: https://doi.org/10.48550/arXiv.2506.05136
