Building truly natural conversational voice agents presents a significant challenge, as powerful cloud-based systems often introduce disruptive delays, while quick on-device responses lack depth. Vidya Srinivas, Zachary Englhardt, and colleagues at the Paul G. Allen School of Computer Science and Engineering, along with Maximus Powers, Shwetak Patel, and Vikram Iyer, address this problem by introducing ‘conversational infill’, a method where a streamlined on-device system generates relevant dialogue and integrates information from a more sophisticated backend in real time. The team developed ConvFill, a model with 360 million parameters, trained on simulated conversations, and demonstrates that this approach successfully decouples responsiveness from capability. Evaluation shows ConvFill achieves accuracy improvements of 36 to 42 percent over comparable small-scale models, all while maintaining response times under 200 milliseconds, paving the way for on-device conversational agents that are both quick and knowledgeable.

Streaming Knowledge Improves Conversational Language Models

This research focuses on enhancing the interaction between humans and Large Language Models (LLMs) in conversational settings, aiming to create more responsive, natural, and well-informed dialogue. Scientists are tackling the challenge of bridging the gap between the intelligence of LLMs and the need for immediate responses in conversation. A key innovation is ‘conversational infill’, a novel task where a smaller model generates dialogue while simultaneously incorporating streaming knowledge from a larger, more powerful backend model. This allows the smaller model to manage the flow of conversation while the larger model provides factual grounding and deeper reasoning.

The research involves dividing tasks between multiple models, allowing for specialization and optimization. The ability to incorporate information during the conversation, known as ‘streaming knowledge’, is crucial for maintaining relevance and accuracy. Techniques like speculative decoding are employed to reduce response generation time, and scientists are developing better evaluation metrics to assess the nuances of natural dialogue. This work builds upon related fields such as Human-Computer Interaction, Natural Language Processing, Speech Recognition, Dialogue Systems, and Multimodal AI. Future research directions include developing more robust evaluation metrics, reducing the gap between small and large models, enhancing conversational grounding, exploring new model architectures, and addressing user interaction patterns. Ultimately, this research aims to create more practical, engaging, and reliable conversational AI systems.

Streaming Dialogue Infill with External Knowledge

Scientists have pioneered a new conversational infill technique to combine the reasoning power of large cloud-based models with the immediate responsiveness required for natural conversation. They developed ConvFill, a 360 million parameter model, designed to generate contextually appropriate dialogue while seamlessly integrating streaming knowledge from a powerful backend system. This approach decouples response latency from capability, enabling systems that feel responsive while accessing extensive knowledge resources. The team engineered a multi-turn streaming pipeline where ConvFill generates responses in an interleaved manner, referencing external knowledge chunks as they become available.

They conducted experiments on a randomly selected subset of the NaturalQuestions dataset, comparing ConvFill’s performance against both standalone backend models and similarly-sized on-device language models. They tested ConvFill with three different backend models, GPT-5, Claude Sonnet 4. 5, and Gemini-2. 5-Pro, to assess its adaptability. Response times were consistently low, with all ConvFill configurations operating under 200 milliseconds.

Notably, ConvFill reduced the TTFT of Claude Sonnet 4. 5 from 2. 16 seconds to approximately 0. 17 seconds, and similarly improved the response time of Gemini-2. 5-Pro from 10.

9 seconds to 0. 17 seconds. The team also assessed the accuracy of ConvFill on the NaturalQuestions dataset, revealing a trade-off between responsiveness and full knowledge utilization. To verify that ConvFill maintains consistency and grounding, researchers utilized a finetuned DeBERTaV3 model to measure turn-level entailment, assessing whether the generated responses align with the streamed knowledge. Entailment rates ranged from 28. 1% to 35. 5% depending on the backend model, indicating that ConvFill relies on its context in approximately one-third of cases.

Conversational Infill Decouples Responsiveness and Reasoning

Scientists have developed a new conversational architecture that successfully combines the responsiveness of on-device models with the reasoning capabilities of large cloud-based foundation models. The core of this work is a task called conversational infill, where a small on-device model generates contextually appropriate dialogue while seamlessly incorporating knowledge from a powerful backend system. Experiments demonstrate that this approach effectively decouples response latency from model capability, enabling systems that feel immediately responsive while accessing sophisticated reasoning. The team trained a 360 million parameter model, named ConvFill, on synthetic multi-domain conversations to perform this task.

Evaluation across multiple backend models reveals that conversational infill can be successfully learned, with ConvFill achieving accuracy improvements of 36 to 42 percent over standalone small models of the same size. Crucially, the system consistently maintains sub-200 millisecond response latencies, ensuring a natural conversational flow. These results demonstrate the potential for building on-device conversational agents that are both knowledgeable and immediately responsive. This architecture addresses the challenge of latency inherent in cloud-based systems by allowing the on-device model to begin responding immediately, without relying on generic filler phrases.

Simultaneously, a high-capacity cloud model processes the conversation in the background, and its outputs are seamlessly incorporated into the dialogue as they become available. This modular design allows for flexible integration of different backend language models across various domains and use cases, while maintaining the benefits of both on-device speed and cloud-based reasoning. The research confirms that users prioritize contextually appropriate responses over simply minimizing latency, and this system delivers both.

👉 More information
🗞 ConvFill: Model Collaboration for Responsive Conversational Voice Agents
🧠 ArXiv: https://arxiv.org/abs/2511.07397

Tags:

conversational infill ConvFill Dialogue generation Foundation Models multi-domain conversations on-device conversational agents response latency streaming knowledge

Convfill Enables Responsive Conversational Voice Agents with 42% Accuracy Via Model Collaboration

Streaming Knowledge Improves Conversational Language Models

Streaming Dialogue Infill with External Knowledge

Conversational Infill Decouples Responsiveness and Reasoning

Rohail T.

Latest Posts by Rohail T.:

AI Learns to Verify Financial Data, Cutting Errors in Automated Reports

Heavy Atoms Defy Expectations As Relativity and Electron Interactions Combine Unexpectedly

AI Image Generation Now Obeys Complex Rules with Perfect Logical Consistency