Medical LLMs Show Bias with Varied Clinical Input, Study Reveals.

MedPerturb, a dataset of 800 clinical cases with varied inputs, reveals differences in responses between large language models and human clinicians. Language models demonstrate greater sensitivity to alterations in gender and phrasing, while humans are more affected by changes in format, such as clinical summaries. This highlights the need for robust evaluation of medical language models under realistic conditions.

The reliability of artificial intelligence in healthcare hinges on its capacity to maintain consistent diagnostic and treatment recommendations despite subtle variations in patient presentation or the manner in which information is conveyed. Researchers are now focusing on how large language models (LLMs), increasingly employed as clinical decision support tools, respond to non-content perturbations – alterations to phrasing, style, or format – compared to human clinicians. A new study, detailed in the article ‘The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making’, introduces a dataset designed to systematically assess this sensitivity. Abinitha Gourabathina, Yuexing Hao, Walter Gerych, and Marzyeh Ghassemi, all affiliated with the Massachusetts Institute of Technology, with additional contributions from Cornell University, present MedPerturb, a collection of 800 clinical scenarios deliberately modified across axes of gender representation, stylistic nuance, and presentational format. Their analysis reveals a divergence in response patterns, with LLMs exhibiting greater sensitivity to alterations in gender and style, while human experts demonstrate increased sensitivity to changes in format, such as clinical summaries generated by the models themselves.
The evaluation of large language models (LLMs) in healthcare necessitates robust methodologies that accurately reflect the complexities of real-world clinical scenarios, and current static benchmarks often prove inadequate in capturing the dynamic nature of medical data. Researchers address this limitation with the introduction of MedPerturb, a novel dataset comprising 800 clinical vignettes systematically perturbed across three key dimensions: gender presentation, stylistic phrasing, and input format. This innovative approach simulates the inconsistencies commonly encountered in clinical practice, providing a more realistic assessment of LLM performance and highlighting crucial differences in how artificial intelligence and human clinicians process information.

The construction of MedPerturb actively introduces gender modifications, including gender-swapping and removal, to assess potential biases within LLMs and ensure equitable healthcare recommendations. Simultaneously, stylistic variations, such as uncertain phrasing and colloquial language, mimic the nuances of patient communication, challenging LLMs to interpret information accurately despite variations in expression. Crucially, the dataset incorporates format changes, specifically utilising LLM-generated multi-turn conversations and summaries, to evaluate performance across different input modalities and assess the ability of LLMs to synthesise information effectively. Researchers supplement this comprehensive dataset with outputs from four distinct LLMs and readings from three human expert clinicians for each clinical context, enabling direct comparison between artificial and human reasoning processes. A multi-turn conversation, in this context, refers to an exchange of dialogue between a patient and a clinician, allowing for clarification and elaboration of information.

Analysis utilising MedPerturb reveals significant discrepancies in how LLMs and human clinicians respond to these perturbations, demonstrating a discernible difference in response patterns. LLMs exhibit greater sensitivity to alterations in gender and stylistic cues than human experts, while human annotators demonstrate increased sensitivity to changes in format, particularly when presented with LLM-generated summaries or multi-turn conversations, highlighting the importance of critical evaluation skills. This suggests that LLMs may be more susceptible to superficial variations in language or presentation, whereas human clinicians are better equipped to discern the underlying clinical meaning, even when information is presented in a less structured or conventional format.

Researchers advocate for more comprehensive and realistic assessment frameworks, establishing the necessity for evaluation methodologies that move beyond assessing performance on fixed datasets and instead focus on modelling responses to dynamic, perturbed inputs. The MedPerturb dataset represents a valuable resource for the research community, with authors facilitating reproducibility by releasing the dataset alongside LLM outputs and human clinician readings, enabling validation and extension of their findings.

Future work should concentrate on expanding the scope of perturbations within the MedPerturb framework, investigating the impact of variations in patient demographics, socioeconomic status, and medical history to provide a more comprehensive understanding of potential biases and vulnerabilities in LLM performance. Furthermore, researchers intend to explore the integration of MedPerturb with other existing clinical datasets to enhance its generalisability and applicability across diverse medical specialties, broadening the scope of evaluation and ensuring its relevance to a wider range of clinical scenarios.

An important avenue for future research involves developing methods to mitigate the observed sensitivities of LLMs to gender and stylistic cues, exploring techniques such as adversarial training or data augmentation to potentially improve the robustness of these models and reduce the risk of biased or inequitable healthcare recommendations. Adversarial training involves exposing the model to intentionally misleading inputs to improve its resilience, while data augmentation involves creating synthetic data to increase the diversity of the training set. Simultaneously, researchers will focus on enhancing the ability of LLMs to accurately interpret and process complex clinical summaries and multi-turn conversations, thereby bridging the gap in performance observed between AI and human clinicians and ensuring that AI-generated recommendations are aligned with human clinical reasoning.

Researchers plan to extend the evaluation framework to encompass a wider range of clinical decision-making tasks, including diagnosis, treatment planning, and prognosis, to provide a more holistic assessment of LLM capabilities. This will require the development of novel metrics that capture the clinical relevance and safety of AI-generated recommendations, ensuring that these models are deployed responsibly and ethically within healthcare settings and ultimately improving patient outcomes. The ongoing development and refinement of evaluation methodologies will be crucial for realising the full potential of LLMs in healthcare.

👉 More information
🗞 The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making
🧠 DOI: https://doi.org/10.48550/arXiv.2506.17163

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

IBM Remembers Lou Gerstner, CEO Who Reshaped Company in the 1990s

December 29, 2025
Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

Optical Tweezers Scale to 6,100 Qubits with 99.99% Imaging Survival

December 28, 2025
Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

Rosatom & Moscow State University Develop 72-Qubit Quantum Computer Prototype

December 27, 2025