On May 2, 2025, a study titled Digital Pathway Curation (DPC): a comparative pipeline to assess the reproducibility, consensus and accuracy across Gemini, PubMed, and scientific reviewers in biomedical research was published. The research evaluates the performance of the Gemini large language model against traditional PubMed searches and human expert curation, demonstrating high reproducibility and accuracy in analyzing complex biomedical data.
The study introduces a Digital Pathway Curation (DPC) pipeline to evaluate Gemini’s reproducibility and accuracy against PubMed search and human expert curation. Using two omics experiments, researchers created an Ensemble dataset to assess pathway-disease associations. Results show Gemini achieves 99% run-to-run reproducibility and 75% inter-reproducibility. A smaller dataset was used to calculate crowdsourced consensus (CSC), revealing Gemini’s multi-consensus accuracy of approximately 87%. These findings demonstrate that large language models like Gemini are reliable tools for navigating complex biomedical knowledge, offering high reproducibility and accuracy in pathway-disease association analysis.
The article discusses the challenges faced by large language models (LLMs) such as GPT-3 regarding reproducibility, which is crucial for applications in healthcare and law. Reproducibility issues are categorized into run-to-run (R2R) and inter-model (IM) inconsistencies. R2R refers to variability within the same model across different runs, while IM concerns differences between various models on the same task. The study found low R2R reproducibility (30-50%) and slightly better IM reproducibility (around 60%).
To evaluate performance, metrics like F1-score, sensitivity, specificity, accuracy, precision, and FDR were used. Issues such as hallucination and confabulation were also assessed. Tools including GSEA for gene analysis and MMC for model consensus were employed to gauge reliability.
Results indicated that while some tasks showed consistency across models, others varied significantly due to ambiguity or subjective reasoning needs. Sensitivity was low (25%), indicating poor true positive detection, but specificity (60-70%) showed better false positive avoidance, though overall reliability remained limited.
The discussion suggests model consensus as a solution, aggregating answers from multiple models to enhance reliability. The conclusion advocates for improved evaluation metrics, reporting reproducibility data, and using model ensembles for critical tasks. It also notes the discontinuation of older models like Google’s 1.0-pro, emphasizing the need for continuous support.
In summary, LLMs offer potential but face consistency challenges. Addressing these through enhanced methods and evaluation practices is essential for building trust in professional applications.
👉 More information
🗞 Digital Pathway Curation (DPC): a comparative pipeline to assess the reproducibility, consensus and accuracy across Gemini, PubMed, and scientific reviewers in biomedical research
🧠DOI: https://doi.org/10.48550/arXiv.2505.01259
