Query-driven text summarization, the process of creating concise and relevant summaries from web documents based on a user’s search, is crucial for modern web search engines, and researchers are continually seeking ways to improve its speed and accuracy. Zeyu Xiong from Baidu Inc., Yixuan Nan from the Institute of Information Engineering, Chinese Academy of Sciences, and Li Gao, along with colleagues Hengzhu Tang, Shuaiqiang Wang, and Junfeng Wang, all from Baidu Inc., present a new approach that leverages the power of generative models to address this challenge. Their work overcomes limitations of traditional methods, which often struggle with complex search requests and introduce errors through multi-step processing, by distilling a large language model into a highly efficient, domain-specific summarization expert. The resulting model not only surpasses existing industry standards in summary quality, but also achieves remarkable deployment efficiency, capable of processing approximately 50,000 queries per second with minimal delay, representing a significant step forward for real-time information access.
Query-Driven Summarization Faces Pipeline Bottlenecks
In the rapidly evolving landscape of web search, Query-Driven Text Summarization (QDTS) generates concise and informative summaries from documents based on user queries, enhancing engagement and facilitating quick decision-making. Traditional extractive summarization methods, which rank candidate segments, dominate industrial applications, but suffer from limitations. Multi-stage processing pipelines often introduce cumulative information loss and architectural bottlenecks, as errors in early stages propagate and affect final summary quality. These models also struggle to fully capture the semantic relationships between queries and document content, resulting in summaries that lack relevance or coherence. Consequently, more sophisticated summarization techniques are needed to deliver informative summaries aligned with user needs.
Keyword Extraction Guides Query Focused Summarization
This document details a research paper presenting QFAS-KE, a novel approach to Query Focused Answer Summarization (QAS). The proposed solution, QFAS-KE, leverages keyword extraction to identify the most important information within answers and then uses this information to guide the summarization process. The core idea involves extracting key terms from the answer text, scoring sentences based on their relevance to both the query and the extracted keywords, and employing a mechanism to avoid repetitive information before generating a concise summary based on the highest-scoring, non-redundant sentences. Key contributions include keyword-guided relevance, which significantly improves summary relevance, and improved coverage by focusing on keyword-rich sentences. The redundancy removal step contributes to generating more focused and coherent summaries, and the research claims QFAS-KE achieves competitive or superior performance compared to existing QAS methods.
Query-Driven Summarization Achieves State-of-the-Art Performance
Researchers have developed a novel framework, QDGenSumRT, to significantly advance query-driven text summarization for large-scale web search. The team successfully transformed a lightweight language model, initially containing 0. 1 billion parameters, into a specialized query-driven summarization expert through a process of model distillation, supervised fine-tuning, and preference alignment. This was achieved through distilling knowledge from a larger, 10-billion parameter generative model into the smaller student model, followed by full-parameter supervised fine-tuning using carefully curated human-labeled data.
To better reflect real-world user preferences, the team implemented Direct Preference Optimization (DPO) using implicit feedback from online interactions. Finally, model quantization, coupled with a lookahead speculative decoding strategy, ensures low latency and high throughput. Experiments reveal that this new model surpasses the performance of the current production baseline and establishes a new state-of-the-art standard for the task, delivering exceptional deployment efficiency with approximately 50,000 queries per second and an average latency of 55 milliseconds.
Efficient Summarization via Distillation and Preference Optimisation
This research presents a new generative framework, QDGen-SumRT, designed to improve query-driven text summarization for large-scale web search. The team successfully transformed a small language model, containing only 0. 1 billion parameters, into a specialized summarization expert through a combination of techniques including model distillation, supervised fine-tuning, and direct preference optimization. Extensive experiments demonstrate that this approach achieves state-of-the-art performance on industry-relevant metrics, surpassing existing production systems, and exhibits excellent efficiency, capable of processing approximately 50,000 queries per second with an average latency of 55 milliseconds. The authors acknowledge that lookahead decoding introduces computational overhead and requires careful parameter tuning to maximize throughput, suggesting future work could focus on optimizing these parameters further.
👉 More information
🗞 Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search
🧠 ArXiv: https://arxiv.org/abs/2508.20559
