The way people speak allows for multiple ways to convey the same idea, and researchers now investigate how speakers consistently transmit information despite this variability. Hailin Hao and Elsi Kaiser, both from the University of Southern California, explore this phenomenon by revisiting the link between information density and a subtle feature of English grammar, the optional use of the word ‘that’ in clauses. Their work demonstrates that speakers tend to omit ‘that’ when the clause is easily predictable, confirming a long-held idea about maintaining a consistent flow of information. Importantly, this new analysis, which utilises a large database of everyday conversations and advanced computational techniques, reveals that existing methods for measuring predictability capture only part of the picture, and that contextual understanding of words provides a more accurate reflection of how people actually speak.
Building on prior work linking sentence structure to predictability, this research revisits the observation that the optional word “that” in English clauses is more likely to be omitted when the clause is easily anticipated. The study advances this line of inquiry by analysing a large, contemporary conversational dataset and employing machine learning and neural language models to refine estimates of predictability. The results confirm the established relationship between predictability and “that”-omission, however, the findings reveal that previous measures of predictability, based on verb characteristics, capture substantial variation unrelated to actual predictability. This suggests a need for more nuanced approaches to modelling how easily language is anticipated and its impact on sentence structure, motivating the development of improved metrics for quantifying predictability in natural language.
Predicting That-Clause Complements Across Verbs
This data investigates the factors influencing whether a verb is followed by a clause introduced by “that” (for example, “I know that he is here”) or not. The researchers aim to predict when “that” will be present or omitted, providing information for each verb in the dataset detailing the total number of times the verb appears and the number of times it is followed by a clause. This probability indicates how likely the verb is to take a clause (for example, “know” is followed by a clause 23. 95% of the time). Key observations reveal substantial variability, with some verbs almost always taking a clause (“wish” at 79.
63%) while others rarely do (“mean” at 6. 33%). Common verbs like “know”, “think”, “say”, and “like” are well represented, and verbs like “wish”, “hope”, and “guess” strongly prefer clauses, while “mean”, “say”, and “love” do so less often. The data also lists the features used in a statistical model to predict whether “that” will be present or omitted, including both categorical and continuous features. Important predictors include how often a particular subject appears, the type of subject in the main clause, the position of the verb in the sentence, unnecessary repetition of “that”, the presence of filler words, and repetition within the sentence. In summary, this data forms part of a study investigating the factors influencing the use of “that” in English, using a combination of verb-specific information and sentence-level features to build a predictive statistical model.
Speakers Balance Predictability During Conversation
Scientists investigated how speakers manage information flow during conversation, focusing on the flexible use of language and the principle of Uniform Information Density (UID). The team discovered that speakers consistently strive to maintain a steady rate of information transmission, adjusting their language to avoid sudden spikes or dips in predictability. Researchers analysed a large corpus of contemporary conversational data, extracting over 50,000 instances of clauses to examine this phenomenon. Experiments replicated previous findings demonstrating a link between information density and the optional use of “that” in English grammar, specifically confirming that “that” is more likely to be included when the following clause is less predictable, effectively smoothing the information flow for the listener.
However, the research went further, revealing limitations in earlier methods of measuring information density. Previous approaches relied on static probabilities based on verb usage, which failed to capture the dynamic, context-dependent nature of prediction. The team developed refined measures of predictability using contextual word embeddings derived from machine learning and neural language models. These new methods account for a greater degree of variation in language use, providing a more accurate assessment of information density. Results demonstrate that these improved measures significantly enhance the ability to model “that”-mentioning patterns, offering a more nuanced understanding of how speakers manage information. This research provides strong support for the UID hypothesis and highlights the importance of dynamic prediction in language production, with implications for improving natural language processing systems and creating more realistic speech synthesis technologies.
Speakers Balance Information Rate in Conversation
This study provides strong evidence supporting the Uniform Information Density hypothesis at the level of syntax in everyday conversations. Researchers demonstrate that information density, as estimated from contextual word embeddings, significantly predicts the optional use of “that” in certain sentence structures, even when accounting for preferences associated with specific verbs. The findings suggest that speakers adjust their language to maintain a consistent rate of information transmission, influencing choices like whether or not to include “that”. Importantly, the research highlights limitations in traditional linguistic measures of information density, which may conflate general predictability with verb-specific tendencies.
The study acknowledges potential inaccuracies stemming from the use of automatically generated transcripts and reliance on a language model, trained primarily on written text, to estimate spoken language predictions. Future research should explore alternative language models, particularly those trained on conversational data, and investigate additional linguistic features that might contribute to structural choices. By combining large, naturalistic corpora with machine learning techniques, this work underscores a promising approach for studying psycholinguistics and modelling linguistic behaviour at scale.
👉 More information
🗞 Uniform Information Density and Syntactic Reduction: Revisiting -Mentioning in English Complement Clauses
🧠 ArXiv: https://arxiv.org/abs/2509.05254
