Rare diseases affect approximately 300 million people worldwide, posing significant diagnostic challenges due to a lack of experienced physicians. Recent breakthroughs have highlighted the potential of Large Language Models (LLMs) in clinically diagnosing rare diseases. To evaluate LLMs’ capabilities, researchers introduce RareBench, a pioneering benchmark assessing accuracy, efficiency, robustness, and interpretability. This comprehensive framework leverages a rare disease knowledge graph to standardize evaluations, enabling the development of more accurate diagnostic models.
Can Large Language Models Serve as Rare Diseases Specialists?
The diagnosis of rare diseases is a significant challenge, with approximately 300 million people worldwide affected by these conditions. The complexity of differentiating among many rare diseases often leads to unsatisfactory clinical diagnosis rates, primarily due to a lack of experienced physicians. Recent news has highlighted the potential of Large Language Models (LLMs) in clinically diagnosing rare diseases, as seen in the case where ChatGPT correctly diagnosed a 4-year-old’s rare disease after 17 doctors failed.
To bridge this research gap and systematically evaluate the capabilities of LLMs on four critical dimensions within the realm of rare diseases, we introduce RareBench. This pioneering benchmark is designed to assess LLMs’ performance in terms of accuracy, efficiency, robustness, and interpretability. By leveraging a comprehensive rare disease knowledge graph synthesized from multiple knowledge bases, RareBench provides a standardized framework for evaluating LLMs’ diagnostic capabilities.
The development of RareBench is accompanied by the compilation of the largest open-source dataset on rare disease patients, establishing a benchmark for future studies in this domain. This dataset enables researchers to train and evaluate LLMs on a large-scale, real-world scenario, thereby facilitating the creation of more accurate and reliable diagnostic models.
Dynamic Few-Shot Prompt Methodology: Enhancing LLMs’ Diagnostic Performance
To facilitate differential diagnosis of rare diseases, we develop a dynamic few-shot prompt methodology. This approach leverages a comprehensive rare disease knowledge graph synthesized from multiple knowledge bases to significantly enhance LLMs’ diagnostic performance. By generating prompts that are tailored to the specific characteristics of each rare disease, our methodology enables LLMs to learn from a small number of labeled examples and generalize well to unseen cases.
Our experimental findings demonstrate the promising potential of integrating LLMs into the clinical diagnostic process for rare diseases. The results show that LLMs can achieve high accuracy in diagnosing rare diseases, even when compared to specialist physicians. This suggests that LLMs could be a valuable tool in supporting clinicians in their diagnosis and treatment of rare diseases.
Comparative Study: GPT4’s Diagnostic Capabilities vs. Specialist Physicians
To further evaluate the diagnostic capabilities of LLMs, we conduct an exhaustive comparative study between GPT4 and specialist physicians. Our results show that GPT4 can achieve high accuracy in diagnosing rare diseases, even when compared to specialist physicians. This suggests that LLMs could be a valuable tool in supporting clinicians in their diagnosis and treatment of rare diseases.
The findings of our study have significant implications for the development of AI-powered diagnostic tools in the field of rare diseases. By leveraging the capabilities of LLMs, we can create more accurate and reliable diagnostic models that can support clinicians in their diagnosis and treatment of rare diseases. This paves the way for exciting future research directions, including the integration of LLMs into clinical workflows and the development of personalized medicine approaches.
Conclusion
In conclusion, our study demonstrates the potential of Large Language Models (LLMs) in diagnosing rare diseases. By introducing RareBench, a pioneering benchmark designed to systematically evaluate the capabilities of LLMs on four critical dimensions within the realm of rare diseases, we provide a standardized framework for evaluating LLMs’ diagnostic capabilities.
Our experimental findings underscore the promising potential of integrating LLMs into the clinical diagnostic process for rare diseases. The development of dynamic few-shot prompt methodology and the compilation of the largest open-source dataset on rare disease patients establish RareBench as a valuable tool in supporting researchers in their pursuit of developing more accurate and reliable diagnostic models.
As we move forward, it is essential to continue exploring the capabilities of LLMs in diagnosing rare diseases. By leveraging the strengths of these AI-powered tools, we can create more effective diagnostic approaches that support clinicians in their diagnosis and treatment of rare diseases. This paves the way for exciting future research directions, including the integration of LLMs into clinical workflows and the development of personalized medicine approaches.
Publication details: “RareBench: Can LLMs Serve as Rare Diseases Specialists?”
Publication Date: 2024-08-24
Authors: Xuanzhong Chen, Xiaohao Mao, Qihan Guo, Li Wang, et al.
Source:
DOI: https://doi.org/10.1145/3637528.3671576
