Scientists are increasingly challenged by the need to accurately interpret complex data presented in scientific tables and figures, yet current artificial intelligence systems often fall short of robust performance. Addressing this critical gap, Xuehang Guo introduce Anagent, a novel multi-agent framework designed to enhance scientific table and figure analysis. This research is significant because it not only quantifies the difficulties inherent in this task with the AnaBench benchmark, encompassing nine scientific domains and seven complexity dimensions, but also proposes a solution demonstrating substantial performance improvements, even without extensive training, and highlighting the importance of task-oriented reasoning for effective data interpretation.

Current artificial intelligence systems often struggle with the complexities of interpreting multimodal knowledge and integrating evidence from diverse sources, hindering accurate inference in scientific research.

To address these limitations, researchers introduced AnaBench, a large-scale benchmark comprising 63,178 instances sourced from nine scientific domains and systematically categorised across seven dimensions of complexity. Anagent tackles these challenges by employing four specialised agents: a Planner decomposes tasks, an Expert retrieves relevant information, a Solver synthesises coherent analysis, and a Critic iteratively refines results through five-dimensional quality assessment.

This innovative framework leverages modular training strategies, combining supervised finetuning with specialised reinforcement learning to optimise individual agent capabilities while ensuring effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves improvements of up to 13.43% in training-free settings and 42.12% with finetuning.

These gains highlight the importance of task-oriented reasoning and context-aware problem-solving for high-quality scientific table and figure analysis. The development of AnaBench provides a robust platform for quantifying the challenges inherent in scientific data interpretation, specifically addressing the heterogeneity of scientific literature and the need for long-context comprehension.

By decomposing the analytical process into specialised stages mirroring human research workflows, Anagent facilitates a more nuanced and accurate understanding of complex scientific data. This advancement promises to accelerate scientific discovery and improve research communication by enabling AI systems to function more effectively as collaborative “AI co-scientists”. The project’s resources are available at https://xhguo7.github.io/Anagent/.

AnaBench construction detailing source acquisition, data processing and instance generation

AnaBench, a large-scale benchmark, was constructed to quantify challenges in scientific table and figure analysis, featuring instances from nine scientific domains systematically categorised along seven complexity dimensions. The benchmark construction employed a four-stage method beginning with source collection, identifying and gathering relevant papers based on predefined criteria.

Data extraction then isolated tables, figures, and associated contexts, with a configurable context retrieval depth controlling the breadth of referenced information. Extracted data underwent two levels of filtering to ensure quality; a paper-level filter removed invalid papers, while a data-level filter excluded instances with formatting errors or missing information.

Following this, instance construction transformed filtered data into complete scientific analysis instances, each comprising data, contexts, metadata, and gold standard analyses, refined using configurable thresholds for sample number and ground truth length. MLLM-assisted task classification, utilising both rule-based heuristics and MLLM classification, categorised AnaBench along seven dimensions encompassing data and analysis complexity.

Data complexity was assessed across four dimensions: type of analysis data, encompassing tables, figures, or both; domain discipline, spanning nine broad areas and 170 disciplines; format, either LaTeX or XML; and source type, distinguishing between general research papers and reviews. Analysis complexity was characterised by width, representing the reference scope as self-contained, internal, external, or mixed; depth, indicating analytical rigor as shallow or in-depth; and objective, focusing on methodology or experiment.

Evaluation incorporated rule-based metrics including ROUGE-L, BLEU, and word overlap, alongside semantic similarity calculations to assess generated analysis quality. Preliminary studies using Qwen3-VL-8B as a backbone revealed performance struggling to exceed 60% across metrics, highlighting difficulties in multimodal understanding and in-depth inferential generation.

AnaBench benchmark construction and assessment of multi-modal large language model analytical capabilities

Initial evaluations of multi-modal large language model agents revealed performance struggling to exceed 60% across key metrics, highlighting difficulties in multimodal and multi-layout understanding. The research introduces AnaBench, a large-scale benchmark comprising instances from nine scientific domains systematically categorised along seven complexity dimensions.

AnaBench was constructed through a four-stage process encompassing source collection, data extraction, instance construction, and MLLM-assisted task classification. Data complexity within AnaBench is characterised by four dimensions: type, domain, format, and source, spanning 170 disciplines across nine domains.

Analysis complexity is defined along three dimensions, width, depth, and objective, to comprehensively assess analytical challenges. Preliminary studies utilising the Qwen3-VL-8B agent backbone demonstrated pronounced difficulties in in-depth analysis requiring inferential generation. The Anagent framework, a multi-agent system, was developed to address these challenges, comprising a Planner, Expert, Solver, and Critic, each equipped with specialised tools.

Evaluation employed rule-based metrics including ROUGE-L, BLEU, and word overlap, alongside semantic assessment utilising cosine similarity, SciBERT-Score, and METEOR scores. Furthermore, MLLM-as-judge, leveraging Gemini-2.5-Flash and GPT-4.1-mini, graded generated analyses across five dimensions, consistency, alignment, knowledge utilisation, correctness, and accuracy.

Anagent achieved substantial improvements of up to 13.43% in test-time optimisation and 42.12% with modular training, as measured by relative gains. These results underscore the importance of task-oriented reasoning and context-aware problem-solving for high-quality scientific table and figure analysis.

AnaBench evaluation reveals enhanced scientific data analysis capabilities

Researchers have developed Anagent, a multi-agent framework designed to improve the analysis of scientific tables and figures. This system addresses limitations in current artificial intelligence systems which struggle with the complexity and variability inherent in scientific data, particularly when requiring integration of information from multiple sources and domain-specific knowledge.

Anagent decomposes complex analytical tasks into manageable subtasks, retrieves relevant information using specialised tools, synthesises this information into coherent analyses, and then iteratively refines these analyses through a five-dimensional quality assessment process. The framework’s performance was evaluated using AnaBench, a new large-scale benchmark comprising instances from nine scientific domains categorised by seven complexity dimensions.

Results demonstrate substantial improvements in performance, up to 170 subdomains, both in training-free settings and with supervised finetuning, highlighting the importance of task-oriented reasoning and context-aware problem-solving for accurate scientific data analysis. The authors acknowledge that the system’s performance is dependent on the quality of the individual agents and the effectiveness of their collaboration. Future research may focus on further enhancing individual agent capabilities and optimising the interaction between agents to achieve even more robust and accurate scientific analysis.

👉 More information
🗞 Anagent For Enhancing Scientific Table & Figure Analysis
🧠 ArXiv: https://arxiv.org/abs/2602.10081

Tags:

multi-agent framework Reinforcement Learning