For over a century, the New England Journal of Medicine’s Clinicopathological Conferences have challenged medical professionals to refine their diagnostic reasoning, and now, increasingly, they are being used to assess the capabilities of artificial intelligence. Thomas A. Buckley and Riccardo Conci, both from Harvard Medical School, alongside Peter G. Brodeur, Jason Gusdorf, Sourik Beltrán, and Bita Behrouzi from Beth Israel Deaconess Medical Center, have developed a new benchmark, CPC-Bench, using a vast archive of these cases to rigorously evaluate large language models. Their work demonstrates that these AI systems now surpass physicians in complex text-based diagnosis and can convincingly mimic the presentation style of medical experts, as evidenced by their AI discussant, Dr. CaBot. While challenges remain in areas like image interpretation and literature review, this research represents a significant step towards transparently tracking progress in medical AI and promises to accelerate the development of increasingly sophisticated diagnostic tools.
Rigorous LLM Diagnostic Reasoning Evaluation
This research details the creation and validation of NEJM AI, a new medical benchmark designed to rigorously evaluate the reasoning abilities of large language models (LLMs) in complex diagnostic scenarios. Existing medical benchmarks often focus on simple factual recall, lacking the complexity of real-world clinical cases and failing to assess how a model arrives at a diagnosis. This new benchmark addresses this gap by evaluating LLMs on tasks requiring integration of information from multiple sources, patient history, physical exams, lab results, and imaging, and handling ambiguous data. While standard metrics like accuracy and precision are used, the evaluation also emphasizes the quality of the reasoning process itself. Current LLMs demonstrate limitations in these complex reasoning tasks, often performing well on simpler tasks but struggling with ambiguity and nuanced information. Techniques like chain-of-thought prompting, which encourages the model to explain its reasoning step-by-step, can improve performance.
Fine-tuning LLMs on medical data further enhances their capabilities. Importantly, open-source models are rapidly improving, offering a viable alternative for research and development. The authors also introduce LongHealth, a benchmark focused on question answering with long clinical documents. By encouraging community collaboration and addressing potential biases in medical data, the researchers aim to accelerate progress in medical AI. CaBot, an AI system designed to emulate expert medical discussants. Researchers collected 7102 CPCs published between 1923 and 2025, alongside 1021 Image Challenge cases, and employed ten physicians to annotate key clinical events within these cases. This detailed annotation process captured distinct clinical touchpoints and categorized word-level event types, providing a granular understanding of each case’s progression.
Evaluations using CPC-Bench revealed that one large language model achieved a first-ranked final diagnosis in 60% of 377 contemporary cases and appeared within the top ten in 84%, surpassing the performance of a panel of twenty physicians. The system also demonstrated a remarkable 98% accuracy in selecting appropriate next tests, highlighting its diagnostic reasoning capabilities. While excelling in text-based differential diagnosis, performance was lower on image interpretation and literature retrieval, with the model and another leading AI achieving 67% accuracy on image challenges. In blinded comparisons, physicians misclassified the source of differential diagnoses generated by CaBot in 74% of trials, and consistently scored CaBot’s presentations favorably across quality dimensions.
These results demonstrate that LLMs can convincingly emulate expert medical presentations and exceed physician performance in complex text-based reasoning, while still presenting challenges in visual and information-gathering tasks. The release of both CPC-Bench and Dr. CaBot aims to facilitate transparent tracking of progress in medical AI and catalyze further development in this rapidly evolving field.
LLM Reasoning Surpasses Clinicians in Diagnosis
Large language models now demonstrate remarkable capabilities in complex medical reasoning, exceeding physician performance on text-based differential diagnosis and convincingly emulating the presentation style of expert clinicians. When challenged with contemporary cases, one model achieved a first-ranked final diagnosis in 60% of instances and appeared within the top ten in 84%, surpassing the performance of a panel of physicians. The system also demonstrated high accuracy in selecting appropriate next diagnostic tests, achieving 98% accuracy.
However, the research also identifies areas for improvement, notably in image interpretation and literature retrieval, where performance remains lower than that achieved with text-based tasks. While the model achieved 67% accuracy on image challenges, further development is needed to fully integrate visual information into the diagnostic process. In blinded comparisons, physicians frequently misclassified text generated by the AI as being authored by a human expert, and often rated the AI’s presentations more favourably, demonstrating the system’s ability to convincingly emulate expert communication. The researchers acknowledge that evaluations based solely on final diagnosis may overstate real-world capabilities, and emphasize the importance of assessing reasoning at each step of the diagnostic process. To facilitate continued progress in medical AI, the team has released both the CPC-Bench benchmark and the “Dr. CaBot” AI system for wider research use, enabling transparent tracking of future advancements.
👉 More information
🗞 Advancing Medical Artificial Intelligence Using a Century of Cases
🧠 ArXiv: https://arxiv.org/abs/2509.12194
