Research demonstrates a new task, Aggregative Question Answering, which reasons across large volumes of chatbot conversations to identify collective concerns within specific demographics. A benchmark dataset, WildChat-AQA, comprising over 6,000 questions derived from 182,000 real interactions, reveals limitations in existing methods for this type of analysis.
The proliferation of conversational AI is generating a wealth of data reflecting public opinion and emerging trends. However, analysing individual interactions overlooks potentially valuable collective insights hidden within large-scale conversation logs. Researchers are now addressing this challenge with a new task – Aggregative Question Answering – which demands reasoning across thousands of user-chatbot exchanges to identify patterns and concerns. Wentao Zhang from the University of Waterloo, Woojeong Kim from Cornell University, and Yuntian Deng, also from the University of Waterloo, detail this work in their article, “From Chat Logs to Collective Insights: Aggregative Question Answering”, where they introduce a new benchmark dataset, WildChat-AQA, comprising over 6,000 questions derived from more than 182,000 real-world conversations, and demonstrate the limitations of current methods in extracting these collective understandings.
Aggregative Question Answering (AQA) focuses on deriving insights from extensive collections of conversational data, typically interactions between users and chatbots. A key challenge within this field is the ability to synthesise information across multiple turns of conversation to provide comprehensive answers. To facilitate evaluation of AQA systems, researchers have introduced WildChat-AQA, a new benchmark dataset comprising 6,027 questions.
Current methodologies for AQA demonstrate limitations when applied to datasets of this scale. These systems frequently struggle with the computational burden associated with processing large volumes of conversational data, or fail to integrate information from multiple sources within the dialogue effectively. Consequently, they either produce inaccurate or incomplete answers, or require disproportionately high computational resources to achieve acceptable performance.
👉 More information
🗞 From Chat Logs to Collective Insights: Aggregative Question Answering
🧠 DOI: https://doi.org/10.48550/arXiv.2505.23765
