Large Language Models hold considerable potential to assist medical professionals, but their effectiveness in complex clinical scenarios requires careful evaluation. Mengdi Chai from Harvard School of Public Health and Massachusetts General Hospital, along with Ali R. Zomorrodi from Massachusetts General Hospital, Harvard Medical School, and the Broad Institute of MIT and Harvard, and colleagues investigated whether refining the instructions given to these models, a technique known as prompt engineering, consistently improves their performance in clinical decision-making. The team assessed three leading LLMs, ChatGPT-4o, Gemini 1.5 Pro, and Llama 3 70B, across a complete clinical workflow, from initial diagnosis to treatment recommendations, using realistic patient cases. Their research reveals that prompt engineering is not a universal solution, and while it can enhance performance in specific areas, it may actually hinder accuracy in others, demonstrating that successful integration of LLMs into healthcare demands nuanced, task-specific strategies.
LLMs Evaluate Full Clinical Reasoning Workflow
Large language models (LLMs) are being explored for their potential in medical decision support, but their utility in real-world clinical settings remains largely unknown. Pro, and Llama 3.3 70B, across the entire clinical reasoning workflow of a typical patient encounter. Researchers used 36 case studies to assess the models’ performance in five sequential tasks: differential diagnosis, essential immediate steps, relevant diagnostic testing, final diagnosis, and treatment recommendation, testing both standard and refined prompting approaches. The team investigated how incorporating retrieval-augmented generation (RAG) with a local medical knowledge base could enhance LLM performance and address knowledge gaps, utilizing a vector database for efficient information retrieval. The study evaluated GPT-3.5, LLaMA, and Gemini on tasks related to differential and final diagnosis, employing baseline prompting alongside a more sophisticated “MedPrompt” strategy. MedPrompt incorporates Chain-of-Thought reasoning, encouraging the model to explain its reasoning, and dynamic few-shot learning, providing examples to guide the model. Two variations of few-shot learning were tested: KNN, using examples most similar to the current case, and Random, using randomly selected examples. Performance was measured by calculating the percentage of overlap between the model’s answers and established medical knowledge, with accuracy assessed for final diagnosis0.5 Pro, and Llama 3.3 70B, to determine their capabilities in supporting clinical decision-making throughout a complete patient encounter, utilizing 36 clinical case studies. Performance was assessed across five sequential tasks: differential diagnosis, essential immediate steps, relevant diagnostic testing, final diagnosis, and treatment recommendation, employing both standard prompting and advanced prompt engineering techniques. The models demonstrated near-perfect accuracy in establishing a final diagnosis, but exhibited considerable variability in other areas, particularly in identifying relevant diagnostic testing.
Initial assessments revealed moderate performance in differential diagnosis, essential immediate steps, and treatment recommendations, highlighting the complexity of these early-stage clinical reasoning processes. The team discovered that model performance varied depending on the prompting temperature, with ChatGPT-4o performing optimally with a “zero temperature” setting and Llama 3.3 70B achieving stronger results using the default temperature0.5 Pro, and Llama 3.3 70B, to support clinical decision-making throughout a complete patient encounter, encompassing diagnosis and treatment recommendations. The team assessed performance across five key tasks, revealing that while models excelled at determining final diagnoses, their ability to suggest relevant diagnostic testing proved considerably weaker. Model performance varied significantly depending on the specific task and the chosen temperature setting during operation.
Further investigation explored whether carefully crafted prompts, using a modified MedPrompt framework, could improve performance, particularly in areas where the models initially struggled. The results indicated that prompt engineering is not universally effective, and simply selecting prompts closely matching the clinical case did not guarantee better results than random selection. This highlights the complexity of integrating these models into healthcare settings, suggesting that a tailored approach, matching the right model and prompting strategy to the specific clinical task, is crucial for success.
👉 More information
🗞 Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks
🧠 ArXiv: https://arxiv.org/abs/2512.22966
