The challenge of equipping large multimodal models with access to current, real-world knowledge remains a significant hurdle in artificial intelligence, particularly for complex information seeking. To address this, Kartik Narayan and Vishal M. Patel from Johns Hopkins University, alongside Yang Xu, Tian Cao, Kavya Nerella, and Navid Shiee from Apple, present DeepMMSearch-R1, a novel multimodal model capable of performing dynamic, multi-turn web searches. This system overcomes the limitations of existing approaches by intelligently crafting search queries for both image and text, and iteratively refining them based on retrieved information, effectively enabling self-correction and improved accuracy. DeepMMSearch-R1 represents a substantial advance in multimodal web search, demonstrated through extensive experiments and validated by a new, challenging multimodal dataset, DeepMMSearchVQA, designed to teach the model when and how to leverage external knowledge sources.
DeepMMSearchVQA Dataset and Example Queries
This research details the methodology, data, and examples used to evaluate a new multimodal large language model, providing a comprehensive account of the experimental setup for reproducibility. The work showcases interactions with the model on the DeepMMSearchVQA dataset, revealing its reasoning process, search queries, retrieved information, and final answers, all illustrated with visual examples. Researchers also detail the prompts used for supervised fine-tuning, retrieval-augmented generation, and evaluation as a judge, carefully engineered to elicit desired behaviour, with an LLM used to assess semantic correctness and fluency.
Multimodal Search with Dynamic Query Refinement
This study introduces DeepMMSearch-R1, a novel multimodal large language model designed to access and integrate external knowledge from the web during information seeking. Recognizing limitations in existing search-equipped models, the team engineered a system capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. DeepMMSearch-R1 distinguishes itself by initiating searches based on relevant crops of input images, enhancing image-based information retrieval, and iteratively adapting text queries based on retrieved information, enabling self-reflection and correction during the search process. To train this advanced system, researchers developed DeepMMSearchVQA, a new multimodal VQA dataset constructed through an automated pipeline incorporating real-world information obtained from web search tools.
This dataset contains diverse, multi-hop queries integrating textual and visual information, specifically designed to teach the model when to search, what to search for, which search tool to utilize, and how to reason over the retrieved information. The training process employs a two-stage pipeline, beginning with supervised finetuning followed by online reinforcement learning optimization, allowing the model to refine its search strategies and reasoning abilities. The team demonstrates the system’s capabilities through extensive experimentation across a range of knowledge-intensive benchmarks, successfully answering questions requiring up-to-date information, such as identifying the location of a boat race depicted in an image.
Dynamic Web Search with Self-Refining Queries
The research team has developed DeepMMSearch-R1, a novel multimodal large language model capable of performing dynamic web searches to answer complex, knowledge-intensive questions. Unlike existing methods that rely on static knowledge bases or rigid search pipelines, DeepMMSearch-R1 initiates searches on demand, adapting both text and image queries during multi-turn interactions. A key innovation is the model’s ability to analyze input images and selectively crop relevant regions before performing image searches, significantly improving search effectiveness. Experiments demonstrate that DeepMMSearch-R1 exhibits self-reflection and self-correction capabilities, iteratively refining its search queries based on retrieved information.
For example, when asked about the speed of a bird in an image, the model initially searched for “highest speed of egret”, then refined the query to “highest recorded speed of egret” after analyzing initial search results, ultimately leading to the correct answer of 32 miles per hour. The model’s performance surpasses other baseline approaches and achieves competitive results with the GPT-o3 model. The team introduced DeepMMSearchVQA, a new multimodal visual question answering dataset, to facilitate training and evaluation, incorporating real-world information obtained from web searches, presenting diverse, multi-hop queries that challenge the model’s ability to determine when to search, what to search for, and how to reason over retrieved content.
Dynamic Reasoning with Web-Augmented Visual Queries
DeepMMSearch-R1 represents a significant advance in multimodal large language models, demonstrating enhanced visual question answering capabilities in contexts requiring extensive knowledge. Researchers developed a system capable of integrating on-demand web searches into its reasoning process, allowing it to address complex, information-seeking queries more effectively. Unlike previous retrieval-augmented methods, DeepMMSearch-R1 dynamically refines search queries and incorporates relevant image crops, leading to more efficient and accurate results. The team achieved this through a two-stage training process, first equipping the model with tool-use capabilities and then refining its behaviour using reinforcement learning.
Extensive experiments across multiple benchmarks demonstrate that DeepMMSearch-R1 outperforms existing approaches, showcasing its ability to leverage web-based information to improve performance. While acknowledging potential ethical risks associated with retrieving and summarizing online content, such as bias and misinformation, the researchers emphasize the importance of responsible deployment practices. Future work may focus on expanding the diversity of tools available to the model, improving its ability to process long-form content, and scaling training to encompass a wider range of languages and modalities.
👉 More information
🗞 DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
🧠 ArXiv: https://arxiv.org/abs/2510.12801
