Researchers have identified a critical need for robust evaluation of Retrieval-Augmented Generation (RAG) systems within the complex field of biomedicine. Wei Zhu from the University of Hong Kong, alongside colleagues, present MRAG , a new benchmark designed to rigorously assess RAG performance across diverse English and Chinese tasks, utilising a comprehensive corpus built from Wikipedia and PubMed. This work is significant because, despite the rapid integration of RAG into scientific and clinical question answering, a dedicated and thorough evaluation benchmark has been notably absent , MRAG directly addresses this gap and includes the MRAG-Toolkit to enable systematic analysis of RAG components. Their experiments demonstrate RAG’s potential to improve LLM reliability, whilst also highlighting the influence of various implementation choices and revealing trade-offs between reasoning quality and response readability.
The research team constructed MRAG using a comprehensive corpus built from both Wikipedia and PubMed, encompassing tasks in both English and Chinese languages, and comprising a total of 14,816 test samples across four distinct task cohorts. Crucially, the team didn’t simply create a dataset; they also developed the MRAG-Toolkit, a dedicated resource designed to facilitate rigorous exploration of various RAG components and configurations.
Experiments conducted with the MRAG benchmark and toolkit reveal that integrating RAG consistently enhances the reliability of LLMs across all tested medical tasks. The study demonstrates that the performance of these RAG systems is significantly influenced by the chosen retrieval approach, the size of the underlying LLM, and the specific prompting strategies employed. While RAG demonstrably improves the usefulness and reasoning quality of LLM responses, the research also notes a subtle trade-off: responses to more complex, long-form questions may experience a slight reduction in readability. These findings highlight the nuanced interplay between accuracy and clarity when leveraging RAG in medical contexts.
To further empower the research community, the scientists are releasing both the MRAG-Bench dataset and the MRAG-Toolkit under a CCBY-4.0 license. This open-access approach will facilitate wider adoption and application of the benchmark by both academic researchers and industry professionals, accelerating innovation in medical question-answering systems. The MRAG-Toolkit supports three distinct retrieval approaches, sparse retrieval, semantic retrieval, and webpage search, alongside a range of retrieval algorithms, locally deployed or API-based LLMs, and diverse prompting strategies, offering a flexible platform for systematic investigation. Extensive experimentation using the MRAG-Toolkit confirms a log-linear relationship between LLM size and performance, with larger models benefiting more substantially from RAG augmentation. The work establishes that RAG not only improves factual accuracy and knowledge integration but also enhances the transparency of LLM reasoning by grounding responses in retrieved documents. Ultimately, this research provides a crucial foundation for building more reliable, informative, and trustworthy AI systems for the demanding field of biomedicine, paving the way for improved clinical decision-making and accelerated medical discovery.
Medical RAG Benchmark and Toolkit Development is crucial
Scientists introduced the Medical RAG (MRAG) benchmark to address the lack of comprehensive evaluation for Retrieval-Augmented Generation (RAG) systems in the medical domain. This work pioneers a new corpus constructed from both Wikipedia and PubMed, providing a robust foundation for assessing RAG performance across diverse medical tasks in English and Chinese languages. Crucially, researchers also developed the MRAG-Toolkit, a systematic implementation of RAG components designed to facilitate detailed exploration of different configurations. The toolkit enables rigorous analysis of how various elements contribute to overall system effectiveness, moving beyond simple performance metrics.
Experiments employed a multi-faceted approach, evaluating Large Language Models (LLMs) on closed-form tasks, Multiple Choice Question Answering (MCQA), Information Extraction (IE), Long-form Question Answering (LP), PubMedQA, BioASQ and MMLU-Med, both with and without RAG. For these tasks, the team utilised the combined corpus as the source of referential documents, retrieving eight snippets per query using the BGE-base retriever and concatenating them to the prompt when RAG was activated. The temperature parameter was meticulously set to 0.7, top_p to 0.8, and the repetition penalty to 1.05, ensuring consistent decoding across all LLMs tested. The COT-Refine strategy was consistently used to elicit responses, promoting detailed reasoning.
To assess long-form question answering (LFQA), the study pioneered an arena-based evaluation system where GPT-4 and GPT-3.5 generated responses, with or without RAG, and were then judged by GPT-4 based on usefulness and reasoning quality. Each model pair engaged in at least 80 matches across evaluation axes, with medical experts also annotating a subset to validate the quality of judgements. The Elo rating system, initially assigning each LLM a score of 1000 with a K factor of 40, was then used to rank the models, revealing nuanced differences in performance. This innovative approach allows for a more granular understanding of how RAG impacts the quality of long-form responses, beyond simple accuracy scores.
The results demonstrate that RAG consistently enhances LLM reliability across MRAG tasks, improving average scores on MCQA, IE, and LP cohorts. Notably, Qwen2.5-72B outperformed MEDITRON on all three cohorts when utilising MRAG, suggesting its superior ability to integrate retrieved information into reasoning steps. This new benchmark encompasses tasks in both English and Chinese, utilising a corpus built from Wikipedia and PubMed, and is accompanied by the MRAG-Toolkit to facilitate systematic exploration of RAG components. Experiments revealed that integrating RAG consistently enhances the reliability of Large Language Models (LLMs) across all four MRAG task types, multi-choice question answering, information extraction, link prediction, and long-form question answering. The team measured performance improvements across these diverse tasks, demonstrating RAG’s broad applicability within medical contexts.
Results demonstrate that LLM performance is directly affected by the referential corpus used, the chosen retrieval approaches and models, and the prompting strategies employed. Extensive experimentation using the MRAG-Toolkit showed that LLM performance exhibits a log-linear relationship with model size, with larger LLMs benefiting more significantly from RAG integration. Specifically, the study observed that while RAG improves reasoning, medical knowledge, and overall usefulness, LLM responses can become slightly less readable when addressing long-form questions. The MRAG benchmark comprises a diverse set of tasks, with the composition visualized in Figure 1, reflecting the proportional size of each task’s test set.
The researchers constructed a Chinese multi-choice question answering (MCQA) dataset containing 1,200 test samples for Traditional Chinese Medicine (TCM), named MRAG-TCM, providing a robust testbed for evaluating models in this specific area. For long-form question answering (LFQA), they utilised the Multi-MedQA dataset, containing 1,066 questions, and collected 1,253 user queries from an online medical consultation platform for the Chinese LFQA dataset, MRAG-CLFQA, ensuring data safety through expert panel review. In information extraction (IE), the study incorporated the DDI, ChemProt, and CMeIE tasks, covering drug-drug interactions, disease-drug-gene relationships, and a broader range of relation types, respectively. Tests prove that the link prediction (LP) task, suitable for evaluating LLMs, directly assesses their ability to establish connections between entities.
The MRAG-Toolkit supports three retrieval approaches, sparse retrieval, semantic retrieval, and webpage search, alongside various retrieval algorithms, locally deployed or API-based LLMs, and diverse prompt strategies. Scientists achieved a systematic framework for investigating how different components of a RAG system affect performance, offering valuable insights for both academic and industrial applications. The work will release the MRAG-Bench dataset and toolkit under a CCBY-4.0 license upon acceptance, fostering further research and development in the field.
MRAG-Bench evaluates medical LLM retrieval performance on complex
Scientists have introduced the Medical Retrieval-Augmented Generation benchmark (MRAG-Bench) and accompanying MRAG-Toolkit to systematically evaluate and improve Large Language Models (LLMs) using Retrieval-Augmented Generation (RAG) in the medical domain. This new benchmark spans four task cohorts in both English and Chinese, offering a robust framework for assessing LLM-based RAG systems, and the toolkit supports diverse retrieval approaches, algorithms, and LLMs for detailed performance analysis. Experiments demonstrated that RAG significantly enhances LLM reliability across all MRAG tasks, indicating its potential to improve accuracy in medical question answering. The research revealed that the selection of the reference corpus, retrieval methods, and prompting strategies substantially impacts LLM performance, and larger LLMs benefit more from RAG integration.
However, the authors noted a trade-off, as LLM responses may become slightly less readable when answering longer, more complex questions, a nuance to consider when deploying these systems. While the study provides extensive experimentation, limitations exist, including the exclusion of several powerful language models due to resource constraints and a focus on a specific RAG workflow; future work will address these points with evaluations of more complex strategies and additional models. The MRAG-Bench and MRAG-Toolkit are intended to be valuable resources for the research community, encouraging further progress in RAG and its medical applications.
👉 More information
🗞 MRAG: Benchmarking Retrieval-Augmented Generation for Bio-medicine
🧠 ArXiv: https://arxiv.org/abs/2601.16503
