Recent advances aim to transform large language models into powerful deep search agents capable of tackling complex real-world problems, but open-source models still struggle with tasks requiring extended reasoning and effective use of browsing tools. To overcome these limitations, Rui Lu, Zhenyu Hou, Zihan Wang, and colleagues at THUDM present DeepDive, a new approach that significantly enhances the long-horizon reasoning abilities of these models. The team achieves this by automatically generating challenging questions from knowledge graphs and then employing multi-turn reinforcement learning to train the models to effectively utilise browsing tools, resulting in a system that outperforms existing open-source alternatives like WebSailor, DeepSeek-R1-Browse, and Search-o1 on the BrowseComp benchmark. This breakthrough demonstrates the potential for substantial improvements in deep search capabilities and paves the way for more sophisticated and effective AI-powered information retrieval.
Language Models, Datasets and Research Tools
This document summarizes the large language models, datasets, APIs, and tools currently driving innovation in the field, reflecting the rapid pace of development and providing a categorized overview of key resources. Key models include OpenAI’s GPT-3, GPT-3. 5, and GPT-4, alongside xAI’s Grok-3 and offerings from Anthropic. Serper provides an API for accessing Google Search functionality, while Toolformer and React are designed to effectively utilize external tools and focus on interactive, responsive behavior, respectively. Supporting these models are datasets like SealQA, designed to challenge question answering systems, and tools such as Deep Research, an autonomous agent for web-based research. Important techniques include Chain-of-Thought Prompting, which encourages models to explain their reasoning, and Reinforcement Learning for fine-tuning performance. Compositionality, web-powered reasoning, information seeking, and agent models are also central to advancements in the field, alongside techniques for internalizing chain-of-action generation.
Automated Deep Search Question Synthesis from Knowledge Graphs
Researchers have developed DeepDive, a novel framework that significantly enhances the deep search capabilities of large language models (LLMs) through an automated method for generating challenging question-answer pairs directly from open knowledge graphs. This process strategically navigates the knowledge graph, identifying relevant entities and attributes, and constructing complex queries that require multi-step reasoning to resolve. The team introduced ambiguity by creating “fuzzy” paths within the knowledge graph, forcing the LLM to disambiguate information during the search process. This automated data synthesis generated a dataset of 3,090 high-quality deep search QAs.
To train the LLMs, scientists implemented an end-to-end multi-turn reinforcement learning (RL) approach, utilizing the GRPO algorithm. This allowed the LLM to interact with a web environment, issuing search queries and refining its reasoning based on the retrieved information. Experiments demonstrated that this multi-turn RL training significantly improved the model’s ability to utilize search tools effectively during inference, leading to enhanced deep search performance. DeepDive-32B achieved an accuracy of 14. 8% on the BrowseComp benchmark, surpassing existing open-source systems like WebSailor, Search-o1, and DeepSeek-R1-Browse.
Further analysis revealed that the RL-trained models exhibited improved test-time scaling of tool calls and enabled parallel sampling, contributing to the overall performance gains. A semi-automated data synthesis method, when combined with the knowledge graph data, further boosted accuracy to 22. 2% on BrowseComp, contributing to the development of more robust and capable deep search agents.
Complex Reasoning Training Data Generation
The research team has achieved a significant breakthrough in deep search agents by developing DeepDive, a system designed to substantially improve the ability of large language models to solve complex tasks requiring information retrieval. This work addresses limitations in existing open-source models, specifically their struggles with long-horizon reasoning and the scarcity of challenging training data. The core of this achievement lies in a method for synthesizing training data that goes beyond conventional question answering. Researchers constructed complex reasoning paths within knowledge graphs, starting with initial nodes and navigating through relationships to form extended paths.
These paths were then enriched with node attributes and deliberately obfuscated using a large language model, creating questions that demand iterative searching, filtering, and synthesis of information. A key aspect of this process involves filtering candidate nodes during path construction, maintaining an appropriate out-degree range to ensure path quality. Experiments demonstrate that DeepDive-32B achieves state-of-the-art performance on the BrowseComp benchmark, surpassing models like WebSailor, DeepSeek-R1-Browse, and Search-o1. Multi-turn reinforcement learning training significantly enhances deep search capabilities, contributing substantially to performance improvements across multiple benchmarks. DeepDive enables test-time scaling of tool calls and parallel sampling, increasing efficiency. An example of a synthesized question requires tracing a football player’s career, identifying a team substitution, and ultimately determining which continental club competition the team qualifies for, demonstrating the complexity of the generated queries.
Deep Search Enhanced by Data and Learning
DeepDive represents a significant advancement in deep search agents, aligning complex reasoning with multi-turn web searches through automated question synthesis and reinforcement learning. The team developed a data pipeline that generates challenging, multi-hop questions, mirroring the complexity of real-world long-horizon tasks, and then employed reinforcement learning to enhance an open large language model’s ability to utilize web search tools effectively. Experiments demonstrate that DeepDive-32B achieves a new state-of-the-art result among open-source models on the BrowseComp benchmark, surpassing the performance of several larger and proprietary agents. Analyses reveal that both the challenging training data and the multi-turn reinforcement learning process contribute to improved tool use and scalability.
👉 More information
🗞 DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
🧠 ArXiv: https://arxiv.org/abs/2509.10446
