Researchers present MDBench, a new dataset designed to rigorously evaluate large language models’ ability to reason across multiple documents. Created through synthetic generation from structured knowledge, MDBench efficiently produces challenging question-answer pairs, revealing significant difficulties for current models even with concise document sets and enabling targeted analysis of multi-document reasoning capabilities.
The increasing sophistication of large language models (LLMs) necessitates robust evaluation benchmarks, particularly in areas demanding complex reasoning. Multi-document reasoning, where models synthesise information from multiple sources, presents a significant challenge, yet existing benchmarks struggle to provide rigorous assessment due to the cost of annotation. Researchers at the University of Michigan and Cisco Research address this gap with a novel synthetic benchmark, MDBench, designed to evaluate LLMs’ ability to reason across multiple documents. Joseph J. Peper, Wenzhao Qiu, Ali Payani, and Lu Wang detail their approach in the article, “MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance”, presenting a dataset created through a controlled, LLM-assisted process that modifies structured knowledge into challenging question-answer pairs, allowing for targeted analysis of model capabilities.
MDBench establishes a rigorous benchmark for evaluating large language models (LLMs) on multi-document reasoning, a capability gaining prominence as these models process increasingly lengthy inputs. Current evaluation methods frequently lack the precision needed to thoroughly examine performance when models must synthesise information across multiple sources, and the creation of such benchmarks is traditionally resource-intensive due to the extensive annotation required for long-form texts. MDBench circumvents these limitations through a synthetic generation process, enabling controlled and efficient creation of challenging document sets and corresponding question-answer pairs.
The core of MDBench’s construction involves manipulating structured data initially presented in tabular format. This data undergoes editing assisted by LLMs to introduce dependencies and complexities, directly demanding multi-hop reasoning from the models under evaluation. Multi-hop reasoning requires the model to perform several inferential steps to arrive at a logical conclusion, rather than simply retrieving a fact. This process culminates in the generation of realistic-sounding documents, forming a comprehensive dataset paired with questions designed to assess information synthesis capabilities.
MDBench categorises reasoning skills into five distinct types, allowing for granular assessment of LLM performance across various cognitive dimensions. These include knowledge aggregation, multi-hop reasoning, numeric reasoning, soft reasoning – which evaluates the ability to handle nuanced or imprecise information – and temporal reasoning, testing understanding of time-related contexts. The benchmark’s design ensures each skill receives focused evaluation, providing a detailed profile of a model’s strengths and weaknesses.
Analysis of popular LLMs using MDBench reveals significant challenges even with relatively short document sets, indicating a critical gap in their reasoning capabilities. The knowledge-guided generation technique employed in MDBench not only facilitates targeted analysis of these capabilities but also offers a flexible framework for adapting the benchmark to address future advancements and emerging challenges in the field of LLMs. This adaptability is crucial given the rapid pace of development in the field.
Researchers acknowledge the limitations of current LLMs in effectively synthesising information from multiple sources, highlighting the need for continued development in this area. Future work will focus on expanding the MDBench dataset and incorporating more complex reasoning tasks, contributing to the development of more robust and reliable LLMs capable of effectively processing and synthesising information from multiple sources. This ongoing development aims to improve the reliability and trustworthiness of these models.
MDBench represents a significant step forward in the evaluation of LLMs, providing a valuable tool for researchers and developers. By addressing the limitations of existing methods and providing a targeted assessment of specific reasoning skills, MDBench contributes to the development of more robust and reliable LLMs capable of effectively processing and synthesising information from multiple sources, ultimately benefiting a wide range of applications, from information retrieval and question answering to decision making and problem solving.
👉 More information
🗞 MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance
🧠 DOI: https://doi.org/10.48550/arXiv.2506.14927
