G2-Reader Achieves Robust Multimodal Document QA over Many Interleaved Pages

Researchers are tackling the challenge of question answering over complex, multimodal documents , those containing text, tables and figures , which current systems struggle to interpret effectively. Yaxin Du, Junru Song, and Yifan Zhou, from Shanghai Jiao Tong University, alongside Cheng Wang, Jiahao Gu, and Zimeng Chen, present a new system, -Reader, designed to overcome limitations in existing ‘chunking’ and iterative approaches. This work is significant because -Reader employs a dual-graph system, preserving document structure and tracking evidence more effectively than previous methods, achieving a 66.21% average accuracy on the VisDoMBench dataset and notably surpassing the performance of a standalone GPT-5 model.

G2-Reader tackles long multimodal document understanding

Scientists have developed a novel framework, G2-Reader, to significantly improve question answering over long, multimodal documents containing text, tables, and figures. The research addresses fundamental limitations in current retrieval-augmented generation (RAG) systems, which often struggle with maintaining document structure and navigating extensive content effectively. Traditional methods typically break down documents into isolated chunks, disrupting crucial cross-modal alignments and hindering semantic interpretation. Furthermore, iterative retrieval approaches can falter in long documents, becoming lost in irrelevant information due to a lack of persistent global search state.

To overcome these challenges, the team constructed a dual-graph system comprising a Content Graph and a Planning Graph. The Content Graph preserves the inherent structure of the document and the relationships between different modalities, enabling contextual understanding and preventing semantic fragmentation. This graph facilitates message passing between nodes, allowing evidence representations to evolve and maintain global awareness. Simultaneously, the Planning Graph functions as an agentic directed acyclic graph, decomposing complex questions into sub-questions and tracking intermediate findings to guide the evidence completion process.

Experiments conducted on the VisDoMBench, encompassing five multimodal domains, demonstrate G2-Reader’s substantial performance gains. Utilizing the Qwen3-VL-32B-Instruct model, the system achieved an average accuracy of 66.21%, surpassing strong baseline models and even a standalone GPT-5, which scored 53.08%. This breakthrough reveals that a structured approach to both evidence representation and retrieval-reasoning can empower open-source models to outperform even powerful closed-source alternatives. The work establishes a new paradigm for multimodal document question answering, focusing on both the quality of evidence and the reasoning process.

Specifically, G2-Reader converts documents into a heterogeneous multimodal content graph, with nodes representing paragraphs and multimodal elements like tables and figures, and edges encoding document-native relationships. Lightweight message passing within this graph allows nodes to perceive their broader context, restoring global awareness. The Planning Graph then orchestrates reasoning by decomposing questions and iteratively updating the reasoning state with intermediate findings, guiding precise navigation over the content. This iterative feedback loop ensures coherent evidence assembly, moving from initial uncertainty to a logically sound answer grounded in a stable structural foundation. The researchers highlight that this dual-graph architecture effectively addresses the representation and retrieval challenges inherent in complex multimodal documents.

Dual-Graph RAG for Multimodal Document Understanding enables improved

Scientists developed G2-Reader, a dual-graph system designed to improve question answering over long, multimodal documents. The research addresses limitations in existing retrieval-augmented generation (RAG) systems, specifically the loss of document structure and the accumulation of noise in long contexts. The study formulates RAG as identifying a minimally sufficient evidence set, denoted as E*, that satisfies the information requirements of a given query Q. Achieving this requires navigating a Planning Graph over a Content Graph and retrieving structurally grounded evidence subgraphs for each planning node.

Researchers constructed a multimodal Content Graph, GC, representing the document corpus as a heterogeneous graph with nodes corresponding to atomic information units. Edges within GC encode both structural proximity and semantic relations between these units, preserving the document’s original layout and connections. Each document underwent automated graph construction, beginning with multimodal parsing to identify and generate nodes representing text, images, layouts, tables, and figures. This parsing process extracts atomic information units directly from the source document, ensuring explicit anchoring of multimodal evidence to its original context.

The team then iteratively refined a Planning Graph, GP, modelled as a directed acyclic graph where nodes represent intermediate sub-questions. This graph facilitates agentic retrieval, where the retrieval process is planned and revised based on evidence sufficiency. For each node qi in GP, the system retrieves a structurally grounded evidence subgraph Gqi from GC, approximating the optimal evidence set E* as the union of these subgraphs. Crucially, GP is refined iteratively, while GC remains static during inference, decoupling evidence representation from reasoning structure and enabling systematic coordination between knowledge representation and inference.

Experiments employed the VisDoMBench dataset across five multimodal domains to evaluate G2-Reader’s performance. The system, utilising Qwen3-VL-32B-Instruct, achieved an average accuracy of 66.21%, significantly outperforming strong baselines and a standalone GPT-5, which scored 53.08%. This demonstrates the effectiveness of the dual-graph approach in preserving document structure, managing long-context challenges, and improving the accuracy of multimodal question answering.

G2-Reader outperforms GPT-5 on multimodal question answering

Scientists have developed G2-Reader, a dual-graph system designed to improve question answering over long, multimodal documents. The research addresses limitations in current retrieval-augmented generation (RAG) systems, specifically fragmented evidence representation and unstable long-context retrieval. Experiments revealed that G2-Reader achieves an average accuracy of 66.21% on the VisDoMBench across five multimodal domains. This performance significantly surpasses all baseline models and even a standalone GPT-5, which achieved a score of 53.08%. The team measured performance using the VisDoMBench, a challenging dataset encompassing slides, web pages, academic papers, and textbooks.

Results demonstrate that the dual-graph architecture effectively empowers open-source models to outperform powerful closed-source alternatives by providing a structured foundation for complex evidence assembly. G2-Reader utilizes a Content Graph to preserve semantic structures within multimodal documents and a Planning Graph to maintain reasoning state and guide stepwise evidence assembly. Data shows the system’s superiority and robustness across these diverse document types. Researchers introduced the Content Graph to encode document elements and their relationships, creating a static representation of the document’s core information.

Simultaneously, an agent incrementally builds a Planning Graph, a directed acyclic graph that represents interdependent decisions during the question-answering process. Tests prove that this iterative refinement of the Planning Graph, driven by evidence, allows for more precise and comprehensive retrieval. The system formulates memory construction as an iterative, vision-language model (VLM)-driven consolidation process, distilling core semantics and uncovering document-intrinsic relations. Measurements confirm that G2-Reader factorizes evidence construction into two structured spaces, improving the identification of minimally sufficient evidence sets.

The core objective of the research was to identify an evidence set that satisfies the information requirements implied by a given query, denoted as E* where E |= Q indicates adequate support. The breakthrough delivers a system capable of systematically coordinating knowledge representation with the planning, verification, and composition of information during inference. Comprehensive ablations further validate the effectiveness and complementarity of each component within the dual-graph architecture.

Dual graphs enhance long document question answering

Scientists have developed a new system, G2-Reader, designed to improve question answering over long, multimodal documents containing text, tables, and figures. Current methods often struggle with maintaining document structure and avoiding irrelevant information when processing extensive content. G2-Reader addresses these challenges by employing a dual-graph system consisting of a Content Graph and a Planning Graph. The Content Graph preserves the original document’s structure and relationships between different elements, while the Planning Graph acts as an agentic system, breaking down the question into sub-questions and tracking progress towards a complete answer.

This approach allows the system to leverage validated conclusions from lower-level sub-questions, improving the accuracy and reducing reliance on isolated fragments of information. Experiments on the VisDoMBench benchmark, across five multimodal domains, demonstrate that G2-Reader achieves an average accuracy of 66.21%, surpassing strong baseline models and even a standalone GPT-5, which scored 53.08%. The authors acknowledge a limitation in the maximum number of refinement iterations, potentially hindering performance on exceptionally complex queries. Future research could explore methods for dynamically adjusting this limit or incorporating more sophisticated strategies for identifying and addressing evidence gaps.

The findings signify a substantial advancement in multimodal document question answering, offering a more robust and interpretable method for extracting information from complex sources. By maintaining document structure and employing a strategic planning process, G2-Reader mitigates the issues of semantic fragmentation and irrelevant information accumulation that plague existing systems. This work demonstrates the potential for agentic, graph-based approaches to enhance reasoning capabilities in long-form document understanding.

👉 More information
🗞 -Reader: Dual Evolving Graphs for Multimodal Document QA
🧠 ArXiv: https://arxiv.org/abs/2601.22055

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Robots Gain Near-Real-Time Planning with New, Streamlined ‘world model’ Technology

Robots Gain Near-Real-Time Planning with New, Streamlined ‘world model’ Technology

February 3, 2026
Lancer Achieves Enhanced Long-Form RAG with Comprehensive Information Coverage

Dynamicvla Achieves Rapid Dynamic Object Manipulation with a 0.4B Vision-Language-Action Model

February 3, 2026
Toolweaver Achieves Scalable Tool Use with Large Language Models and Unified Selection

Llama-3.1-8b Achieves 96.0% Math Performance with Hidden Gem Checkpoints

February 3, 2026