Evaluating Large Language Models for Reproducibility and Accuracy in Biomedical Research

On May 2, 2025, a study titled Digital Pathway Curation (DPC): a comparative pipeline to assess the reproducibility, consensus and accuracy across Gemini, PubMed, and scientific reviewers in biomedical research was published. The research evaluates the performance of the Gemini large language model against traditional PubMed searches and human expert curation, demonstrating high reproducibility and accuracy in analyzing complex biomedical data.

The study introduces a Digital Pathway Curation (DPC) pipeline to evaluate Gemini’s reproducibility and accuracy against PubMed search and human expert curation. Using two omics experiments, researchers created an Ensemble dataset to assess pathway-disease associations. Results show Gemini achieves 99% run-to-run reproducibility and 75% inter-reproducibility. A smaller dataset was used to calculate crowdsourced consensus (CSC), revealing Gemini’s multi-consensus accuracy of approximately 87%. These findings demonstrate that large language models like Gemini are reliable tools for navigating complex biomedical knowledge, offering high reproducibility and accuracy in pathway-disease association analysis.

The article discusses the challenges faced by large language models (LLMs) such as GPT-3 regarding reproducibility, which is crucial for applications in healthcare and law. Reproducibility issues are categorized into run-to-run (R2R) and inter-model (IM) inconsistencies. R2R refers to variability within the same model across different runs, while IM concerns differences between various models on the same task. The study found low R2R reproducibility (30-50%) and slightly better IM reproducibility (around 60%).

To evaluate performance, metrics like F1-score, sensitivity, specificity, accuracy, precision, and FDR were used. Issues such as hallucination and confabulation were also assessed. Tools including GSEA for gene analysis and MMC for model consensus were employed to gauge reliability.

Results indicated that while some tasks showed consistency across models, others varied significantly due to ambiguity or subjective reasoning needs. Sensitivity was low (25%), indicating poor true positive detection, but specificity (60-70%) showed better false positive avoidance, though overall reliability remained limited.

The discussion suggests model consensus as a solution, aggregating answers from multiple models to enhance reliability. The conclusion advocates for improved evaluation metrics, reporting reproducibility data, and using model ensembles for critical tasks. It also notes the discontinuation of older models like Google’s 1.0-pro, emphasizing the need for continuous support.

In summary, LLMs offer potential but face consistency challenges. Addressing these through enhanced methods and evaluation practices is essential for building trust in professional applications.

👉 More information
🗞 Digital Pathway Curation (DPC): a comparative pipeline to assess the reproducibility, consensus and accuracy across Gemini, PubMed, and scientific reviewers in biomedical research
🧠 DOI: https://doi.org/10.48550/arXiv.2505.01259

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

December 29, 2025
Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

December 28, 2025
Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

December 27, 2025