Multimodal Planning Agent Achieves Efficient Visual Question-Answering with Dynamic mRAG Pipelines

Researchers are tackling the complex challenge of Visual Question-Answering (VQA), a field demanding the seamless fusion of image and text understanding to deliver accurate answers. Zhuo Chen, Xinyu Geng, and colleagues from ShanghaiTech University, alongside Xinyu Wang, Yong Jiang, Zhen Zhang, and Pengjun Xie from Alibaba Tongyi Lab, present a novel approach that significantly improves both the speed and accuracy of VQA systems. Their work addresses the inefficiencies of current methods , often reliant on lengthy, sequential processes , by training a multimodal planning agent to dynamically optimise how information is retrieved and processed. This intelligent agent learns to prioritise essential steps, slashing computation time by over 60% and reducing reliance on expensive data access, all while demonstrably outperforming existing state-of-the-art VQA models across six diverse datasets.

Dynamic Pipeline Decomposition for Efficient VQA enables flexible

This breakthrough addresses a key limitation of current VQA methods, which often rely on rigid, multi-stage pipelines for processing knowledge-intensive queries, a process that can be computationally expensive and inefficient. This innovative approach optimizes the trade-off between computational cost and performance, leading to significant reductions in processing time and resource usage. The core of this work lies in the creation of a multimodal planning agent capable of adapting to the specific demands of each VQA query. Rather than rigidly following a pre-defined sequence of operations like image grounding, retrieval, and Query rewriting, the agent learns to selectively execute only the essential components needed to generate an accurate response.
For simpler queries that can be answered using the model’s existing knowledge, the agent bypasses unnecessary processing steps altogether. Conversely, for more complex, knowledge-intensive questions, the agent strategically decomposes the mRAG workflow, focusing on retrieving relevant information from both visual and textual sources. This dynamic approach, illustrated in the research, allows the system to allocate computational resources more effectively. This substantial improvement in efficiency is achieved while simultaneously enhancing VQA task performance, marking a significant advancement in the field.

The researchers meticulously annotated data by decomposing VQA queries to facilitate agent training, then fine-tuned a Multimodal Large Language Model (MLLM) to operate within the dynamic workflow. The agent’s decision-making process is formally defined using mathematical notations, outlining how it leverages visual and textual contexts to generate accurate answers. By intelligently selecting the appropriate processing steps, the agent optimizes resource allocation and minimizes unnecessary computations, paving the way for more scalable and practical VQA systems. This work establishes a new paradigm for efficient multimodal reasoning and opens exciting possibilities for real-world applications requiring rapid and accurate visual understanding.

Dynamic Pipeline Decomposition for Visual Question Answering improves

The research team tackled the limitations of existing mRAG methods, which often rely on rigid, multi-stage processes involving image grounding, image retrieval, query rewriting, and text passage retrieval, potentially creating dependencies and redundant computations. To overcome these challenges, they trained an agent capable of dynamically decomposing the mRAG pipeline, intelligently determining the necessity of each step based on the specific VQA query. Experiments employed a. This contrasts with traditional rigid pipelines (‘path 4’) and enables a data-agnostic approach to VQA. The study pioneered a method for training the agent to recognize when redundant retrieval steps are unnecessary, thereby reducing input length and computational cost.

Furthermore, the agent consistently outperformed baseline models, including a Deep Research agent and a carefully designed prompt-based method, demonstrating both improved accuracy and substantial gains in inference efficiency. This innovative approach enables a more streamlined and effective VQA process, optimizing the trade-off between efficiency and effectiveness. The team intends to release the code underpinning this work, facilitating further research and development in multimodal AI and VQA systems.

Multimodal Agent Cuts VQA Search Time

This improvement wasn’t achieved at the expense of accuracy; in fact, the team recorded enhanced VQA performance on average across the six test datasets when compared to both the default complete mRAG setting and other baseline methods. Data annotation was initially performed via VQA query decomposition, followed by fine-tuning of a Multimodal Large Language Model (MLLM) agent, establishing a robust foundation for the research. Researchers formally defined a VQA query as q = (i, t), where ‘i’ represents the image input and ‘t’ the textual question, with ‘a’ denoting the corresponding ground truth answer. Experiments categorised queries into four types: those requiring no mRAG, those needing textual contexts (kt), those needing visual contexts (ki), and those needing both. The agent was trained to predict these categories, minimising a loss function J(θ) = − X q∈D log Pθ(c|q, T), where Pθ(a|b) represents the probability model predicting ‘a’ given input ‘b’, and ‘T’ represents prompts used for category prediction. During inference, if the agent predicted the need for additional contexts, it either rewrote the query to generate a gold query (qg) or supplemented the existing query with either ki or kt, demonstrating a flexible and adaptive approach to VQA.

Dynamic mRAG Pipeline Optimisation for VQA improves accuracy

This agent dynamically decomposes the typical multi-stage retrieval-augmented generation (mRAG) pipeline, intelligently determining which steps are necessary to answer a given question. The research mitigates the inefficiencies of static pipeline architectures commonly found in mRAG systems, offering a promising pathway towards scalable multimodal agent systems. The authors acknowledge a limitation in the predefined workflow of mRAG components, suggesting future work could explore more flexible and adaptable architectures. Further research directions include investigating the agent’s performance in more complex, open-ended VQA scenarios and exploring its potential application to other multimodal tasks.

👉 More information
🗞 Efficient Multimodal Planning Agent for Visual Question-Answering
🧠 ArXiv: https://arxiv.org/abs/2601.20676

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Long-Horizon Planning Degrades in Large Language Models after 25 Moves

Long-Horizon Planning Degrades in Large Language Models after 25 Moves

January 30, 2026
Bloomz Contamination Demonstrates Cross-Directional Errors in 7-8b Multilingual Machine Translation

Bloomz Contamination Demonstrates Cross-Directional Errors in 7-8b Multilingual Machine Translation

January 30, 2026
Quantum Teleportation Fidelity Assessed in Expanding Friedmann-Robertson-Walker Universes with Scalar Fields

Quantum Teleportation Fidelity Assessed in Expanding Friedmann-Robertson-Walker Universes with Scalar Fields

January 30, 2026