Researchers are investigating the fundamental principles underlying the efficiency of natural language, addressing the long-standing puzzle of its surprisingly low entropy rate. Weishun Zhong from the School of Natural Sciences, Institute for Advanced Study, Doron Sivan from the Department of Brain Sciences, Weizmann Institute of Science, and Tankut Can from the Department of Physics, Emory University, working with colleagues Mikhail Katkov and Misha Tsodyks from the School of Natural Sciences, Institute for Advanced Study, present a new statistical framework that captures the multi-scale structure of language. This collaborative work, spanning institutions in the USA and Israel, introduces a method for segmenting text into meaningful chunks, revealing a hierarchical decomposition that accounts for the observed redundancy in English. Their findings not only quantitatively match the established entropy rate of printed English but also propose a dynamic relationship between semantic complexity and entropy, offering a novel perspective on how language evolves and adapts.
The model operates by self-similarly segmenting text into semantically coherent chunks, progressively refining the analysis down to the level of individual words, allowing for analytical treatment of complex relationships within language and revealing how meaning is built from smaller units of information.
Numerical experiments utilising modern large language models and extensive open datasets demonstrate the model’s ability to accurately reflect the structure of real texts at multiple levels of semantic organisation. Critically, the predicted entropy rate, a measure of information content, aligns closely with the well-established estimate of one bit per character for printed English.
The theory proposes that the entropy rate of language is not fixed but increases systematically with the semantic complexity of the text being analysed, governed by a single free parameter within the model, providing a concise framework for understanding linguistic information. This hierarchical structure generates statistical redundancies at every level, from grammatical correctness to the logical flow of ideas, and is represented using “semantic trees”, where each branch summarises a portion of the text.
Researchers explore how ensembles of these trees can be approximated using random recursive partitioning, establishing a robust connection between semantic complexity and information entropy by applying this approach to diverse corpora, ranging from children’s literature to contemporary poetry. A recursive semantic chunking procedure, leveraging large language models, forms the core of the methodology.
Texts are partitioned into semantically coherent chunks, moving from the complete document down to the level of individual words, avoiding the limitations of fixed-size chunking and prioritising semantic contiguity. The study employs a K-ary tree structure as a structural prior, dividing text into a maximum of K chunks at each hierarchical level, inspired by research on human narrative memory where similar tree-structured representations model comprehension and recall.
An LLM recursively identifies these coherent chunks at multiple scales, beginning with the full text and continuing until single tokens are reached, resulting in a hierarchical tree where tokens constitute the leaves and internal nodes represent spans of text at varying resolutions. This induced token tree is modelled as a random weak-integer ordered-partition process, allowing for analytical treatment of the semantic hierarchy.
To quantify these empirical semantic trees, each node, representing a text span, is associated with its token count, capturing the lengths of semantically coherent units. Across a large corpus, these multiscale length statistics are treated as stochastic and modelled using a self-similar splitting process, recursively dividing a text of N tokens into K chunks by placing K-1 boundaries randomly between tokens, with the possibility of empty chunks.
The probability of a child chunk having a specific size, given a parent chunk size, is defined by a splitting kernel, forming a Markov chain over chunk sizes, defining a family of random tree ensembles, and enabling the computation of the probability of each observed semantic tree and ultimately, an entropy estimate for the semantic structure of the text. Initial analysis of texts using large language models reveals an entropy rate of approximately one bit per character, validating the research’s core theoretical framework and demonstrating its ability to quantitatively capture the inherent structure of natural language at various semantic levels.
Further investigation revealed that the entropy rate is not fixed but increases systematically with the semantic complexity of the text corpora, a relationship governed by a single free parameter within the model. Experiments involved applying LLMs to segment texts from diverse corpora, encompassing children’s books to modern poetry, to construct a series of trees, and comparing the resulting chunk-size distributions to a random ensemble identified the optimal splitting parameter, yielding a strong match in terms of chunk-size distribution.
At an intermediate tree level (L = 7) for 20 narratives from RedditStories, the empirical chunk-size distribution closely mirrored the theoretical prediction, as demonstrated in figure 2a, and normalized empirical chunk-size distributions, pooled from 100 narratives, consistently aligned with the theoretical prediction across multiple levels L, further supporting the model’s accuracy. The.
The research successfully converted LLM perplexity into an entropy-rate estimate, demonstrating close agreement with the theoretical entropy rate derived from semantic chunking across diverse corpora. Scientists have long recognised that natural language isn’t random noise, but possesses a deep, underlying structure, and this work offers a compelling new way to quantify that structure, moving beyond simple estimates of entropy to a hierarchical model of semantic organisation.
The significance lies in its potential to refine the foundations of large language models, as current LLMs, while impressive, operate on a fundamentally statistical basis, predicting the next word based on probabilities. A deeper understanding of semantic hierarchy could allow for the creation of models that not only predict but understand text in a more human-like way, leading to more robust and nuanced natural language processing.
The ability to model semantic complexity as a quantifiable parameter is particularly exciting, suggesting avenues for tailoring models to specific domains or writing styles. However, the model’s reliance on currently available datasets and LLMs introduces a circularity, validating the theory against systems already built on statistical principles, potentially reinforcing existing biases. Furthermore, the claim that entropy rate increases with semantic complexity needs further investigation across a wider range of corpora, including those representing diverse cultural and linguistic contexts, and future work might explore how this framework can be applied to other complex systems, such as music or biological sequences, seeking universal principles of hierarchical information encoding.
👉 More information
🗞 Semantic Chunking and the Entropy of Natural Language
🧠 ArXiv: https://arxiv.org/abs/2602.13194
