Evaluating Large Language Models for Reproducibility and Accuracy in Biomedical Research

On May 2, 2025, a study titled Digital Pathway Curation (DPC): a comparative pipeline to assess the reproducibility, consensus and accuracy across Gemini, PubMed, and scientific reviewers in biomedical research was published. The research evaluates the performance of the Gemini large language model against traditional PubMed searches and human expert curation, demonstrating high reproducibility and accuracy in analyzing complex biomedical data.

The study introduces a Digital Pathway Curation (DPC) pipeline to evaluate Gemini’s reproducibility and accuracy against PubMed search and human expert curation. Using two omics experiments, researchers created an Ensemble dataset to assess pathway-disease associations. Results show Gemini achieves 99% run-to-run reproducibility and 75% inter-reproducibility. A smaller dataset was used to calculate crowdsourced consensus (CSC), revealing Gemini’s multi-consensus accuracy of approximately 87%. These findings demonstrate that large language models like Gemini are reliable tools for navigating complex biomedical knowledge, offering high reproducibility and accuracy in pathway-disease association analysis.

The article discusses the challenges faced by large language models (LLMs) such as GPT-3 regarding reproducibility, which is crucial for applications in healthcare and law. Reproducibility issues are categorized into run-to-run (R2R) and inter-model (IM) inconsistencies. R2R refers to variability within the same model across different runs, while IM concerns differences between various models on the same task. The study found low R2R reproducibility (30-50%) and slightly better IM reproducibility (around 60%).

To evaluate performance, metrics like F1-score, sensitivity, specificity, accuracy, precision, and FDR were used. Issues such as hallucination and confabulation were also assessed. Tools including GSEA for gene analysis and MMC for model consensus were employed to gauge reliability.

Results indicated that while some tasks showed consistency across models, others varied significantly due to ambiguity or subjective reasoning needs. Sensitivity was low (25%), indicating poor true positive detection, but specificity (60-70%) showed better false positive avoidance, though overall reliability remained limited.

The discussion suggests model consensus as a solution, aggregating answers from multiple models to enhance reliability. The conclusion advocates for improved evaluation metrics, reporting reproducibility data, and using model ensembles for critical tasks. It also notes the discontinuation of older models like Google’s 1.0-pro, emphasizing the need for continuous support.

In summary, LLMs offer potential but face consistency challenges. Addressing these through enhanced methods and evaluation practices is essential for building trust in professional applications.

👉 More information
🗞 Digital Pathway Curation (DPC): a comparative pipeline to assess the reproducibility, consensus and accuracy across Gemini, PubMed, and scientific reviewers in biomedical research
🧠 DOI: https://doi.org/10.48550/arXiv.2505.01259

Quantum News

Quantum News

There is so much happening right now in the field of technology, whether AI or the march of robots. Adrian is an expert on how technology can be transformative, especially frontier technologies. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that is considered breaking news in the Quantum Computing and Quantum tech space.

Latest Posts by Quantum News:

Multiverse Computing Launches HyperNova 60B 2602, 50% Compressed LLM, on Hugging Face

Multiverse Computing Launches Quantum Inspired HyperNova 60B 2602, 50% Compressed LLM, on Hugging Face

February 24, 2026
AWS Quantum Technologies Blog: New QGCA Outperforms Simulated Annealing on Complex Optimization Problems

AWS Quantum Technologies Blog: New QGCA Outperforms Simulated Annealing on Complex Optimization Problems

February 23, 2026
AWS Quantum Technologies has released version 0.11 of the Qiskit-Braket provider on February 20, 2026, significantly enhancing how users access and utilize Amazon Braket’s quantum computing services through the popular Qiskit framework. This update introduces new “BraketEstimator” and “BraketSampler” primitives, mirroring Qiskit routines for improved performance and feature integration with Amazon Braket program sets. Importantly, the provider now fully supports Qiskit 2.0 while maintaining compatibility with versions as far back as v0.34.2, allowing users to “use a richer set of tools for executing quantum programs on Amazon Braket.” The release unlocks flexible compilation features, enabling circuits to be compiled directly for Braket devices using the to_braket function, accepting inputs from Qiskit, Braket, and OpenQASM3.

AWS Quantum Technologies Releases Qiskit-Braket Provider v0.11, Now Compatible with Qiskit 2.0

February 23, 2026