Researchers have long recognised that for artificial intelligence to truly collaborate with people, it must accurately anticipate human intentions. Peter Zeng, Weiling Li, and Amie Paige, from Stony Brook University, alongside Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras et al, investigated how Large Visual Language Models (LVLMs) establish ‘common ground’ during communication , a fundamental aspect of human interaction. Their new study, detailed in a referential communication experiment, reveals a significant limitation in LVLMs’ ability to interactively resolve ambiguous references, using a unique dataset of 356 human and machine dialogues. This work is crucial because it highlights a key deficit preventing seamless human-AI partnership and provides valuable resources , including data and analysis tools , for improving the modelling of shared understanding in AI systems.
AI struggles with shared understanding in dialogue
Scientists have demonstrated a critical limitation in Large vision language models (LVLMs), their inability to effectively model common ground during interactive communication, hindering true collaboration with humans. This innovative study involved human-human, human-AI, AI-human, and AI-AI pairings engaging in multiple rounds of dialogue, aiming to match pictures and establish shared understanding without relying on pre-defined lexical labels. The team released a comprehensive online pipeline for data collection, alongside analytical tools assessing accuracy, efficiency, and lexical overlap, culminating in a novel corpus of 356 dialogues, comprising 89 pairs over four rounds each, which definitively reveals the shortcomings of current LVLMs in resolving referring expressions interactively.
Experiments show that accurately predicting human intent is paramount for Generative AI to partner effectively, yet current models struggle with the nuanced process of building shared understanding. Researchers designed a task where participants, either two humans or a human paired with an AI, or two AIs, played a collaborative game, attempting to identify objects from a set of images through verbal descriptions over multiple turns. This approach allowed for a detailed analysis of how each pairing adapted their language, converged on shared references, and ultimately achieved (or failed to achieve) successful communication, mirroring the way humans naturally ground meanings during conversation. The study meticulously quantified not only whether LVLMs failed in collaboration, but crucially, why these failures occurred and in which specific roles, director or matcher, the limitations were most pronounced.
The work establishes a new benchmark for evaluating pragmatic competence in AI, moving beyond static assessments to dynamic, multi-turn interactions. By generating real-time dialogues, the researchers captured the complexities of natural conversation, including the negotiation of meaning and the repair of misunderstandings, something largely absent in previous evaluations of LVLMs. The resulting corpus of 356 dialogues provides a rich dataset for the research community, enabling further investigation into the mechanisms underlying successful grounding and the development of more collaborative AI agents. This detailed analysis of language use, coupled with the release of the experimental pipeline and data, opens avenues for improving LVLMs’ ability to adapt to human partners and participate in truly interactive communication scenarios.
This breakthrough reveals that LVLMs struggle to achieve the “lexical entrainment” observed in human conversations, where partners naturally converge on concise and efficient language as they establish common ground. As illustrated in the study, human pairs rapidly refine their referring expressions over rounds of interaction, while LVLMs often fail to adapt, resulting in longer, less efficient dialogues and ultimately, lower accuracy. The research highlights the importance of modelling common ground, the shared knowledge and beliefs between conversational partners, as a crucial step towards building AI agents capable of seamless and effective collaboration with humans in visually grounded tasks, with implications for applications ranging from assistive robotics to human-computer interfaces.
Referential Game Comparing Human and AI Pairs explores
Scientists investigated the intricacies of human-AI collaboration through a referential communication experiment employing a factorial design with director-matcher pairs. The research meticulously compared human-human, human-AI, AI-human, and AI-AI interactions over repeated rounds to assess their ability to match pictures of objects lacking conventional labels. Experiments employed a task inspired by Clark and Wilkes-Gibbs (1986), where a director described a sequence of 12 baskets to a matcher, who then attempted to replicate the order on their screen.
To prevent trivial identification, the matcher’s set included the 12 target baskets plus an additional four distractors, forcing the use of full referring expressions. The study pioneered an online platform built with oTree, an open-source Python package, ensuring consistency across all four experimental conditions. Human participants were recruited via Prolific, a platform known for providing vetted and motivated individuals, and were subjected to strict pre-screening criteria, including native English fluency and a 100% approval record to maintain data quality. Researchers harnessed OpenAI’s GPT-5.2, utilising the “none” reasoning option, as the LVLM for the AI-based conditions.
Initial development began with GPT-4o, but the task’s complexity necessitated a more powerful model. Each round involved identifying 12 objects, and the system meticulously tracked interactions to observe how partners grounded referring expressions, resolved ambiguities, and established conceptual pacts over time. The task screen layout was designed for optimal visibility, displaying both baskets and chat windows without requiring scrolling, while also providing a typing indicator (series of dots) to mimic natural conversational cues. This innovative methodology enabled the team to quantify the efficiency and accuracy of both human and AI partners in establishing common ground.
Analyses focused on measuring the length of expressions used, the number of clarification requests, and the overall success rate in matching the target sequence. The resulting data revealed that while LVLMs could interpret increasingly concise expressions, they struggled to produce efficient descriptions themselves, highlighting a critical gap in their ability to engage in truly collaborative communication. Researchers conducted a referential communication experiment employing a factorial design with director-matcher pairs, human-human, human-LVLM, and LVLM-LVLM, to assess collaborative picture-matching skills across multiple rounds. The study released an online data collection pipeline, analytical tools for evaluating accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues, comprising 89 pairs engaged in four rounds each. This work meticulously unmasks the challenges LVLMs face in interactively resolving referring expressions, a fundamental skill underpinning human language use.
Experiments revealed that the team quantified not only instances where LVLMs failed in collaboration, but also identified the specific roles in which these failures occurred. Data shows that the dialogues were carefully constructed using pictures of objects lacking obvious lexical labels, forcing participants to rely on shared understanding and iterative refinement of descriptions. The researchers will release the full transcripts of these experiments to facilitate further investigation within the research community. This detailed corpus provides a valuable resource for understanding the nuances of human-AI interaction and pinpointing areas for improvement in LVLM design.
Results demonstrate a clear disparity in collaborative efficiency between human pairs and those involving LVLMs. Measurements confirm that human-human pairs consistently exhibited greater efficiency in grounding referring expressions, achieving quicker convergence on the correct target images. The team measured dialogue length and the number of clarification requests as key metrics, finding that LVLM pairs required significantly more turns and prompts to reach successful identification. Specifically, the study highlighted that while LVLM pairs achieved near-human task accuracy in some scenarios, their dialogue differed substantially from human pairs in both efficiency and lexical adaptation.
Tests prove that LVLMs struggle to adapt their language over repeated interactions, failing to establish the “ad hoc conventions”, increasingly efficient referring expressions, commonly observed in human communication. Data collected from the AI-AI pairings showed accuracy ranging from approximately 40% to ceiling, depending on the model, in a simplified task designed to assess this ability. Scientists recorded that while LVLMs in the matcher role could sometimes interpret shortened expressions, those in the director role were less capable of generating more concise and effective descriptions. This breakthrough delivers crucial insights into the cognitive processes underlying successful communication and highlights the need for LVLMs to move beyond simple pattern recognition towards genuine collaborative understanding.
AI hinders referential communication accuracy and efficiency
Scientists have demonstrated that accurately predicting human intent is crucial for effective collaboration between people and generative AI systems. Researchers conducted a referential communication experiment using human and AI pairs to assess their ability to match pictures without relying on established labels. The study involved director-matcher pairings, human-human, human-AI, and AI-AI, engaging in multiple rounds of interaction. A corpus of 356 dialogues was created, revealing limitations in large visual language models (LVLMs) when resolving referring expressions, a key element of human communication.
The findings indicate that incorporating an AI partner, whether as director or matcher, diminished both accuracy and efficiency compared to human-human pairs. While AI-human pairs initially showed comparable or even superior accuracy in the first round, performance declined rapidly for these pairings, whereas AI-AI pairs exhibited a gradual decrease. Notably, human-human pairs consistently improved in accuracy and efficiency across rounds, suggesting a capacity to establish common ground and develop concise, reusable communication strategies. In contrast, even advanced LVLMs like GPT-5.2 failed to demonstrate any ability to build common ground, even when paired with another instance of the same model.
The authors acknowledge that the study was limited to English, a single type of non-lexicalized object, and primarily utilized GPT-5.2 for the full factorial design. The exclusion of open-weight models may affect reproducibility, and maintaining data quality on the Prolific platform required ongoing monitoring and judgment calls regarding participant compensation. Future research will involve deeper analysis of the dialogue transcripts, potentially using dialogue act analysis, and examining instances of communication repair. This work highlights a significant gap in current LVLM capabilities regarding interactive grounding and suggests potential risks when deploying embodied AI in collaborative, human-facing tasks.
👉 More information
🗞 LVLMs and Humans Ground Differently in Referential Communication
🧠 ArXiv: https://arxiv.org/abs/2601.19792
