Researchers at the Georgia Institute of Technology have evaluated the capacity of nine large language models (LLMs), including GPT-4o and LLama-3.1, to accurately detect and appropriately respond to reports of adverse drug reactions stemming from psychiatric medication, a growing area of self-diagnosis given global limitations in mental healthcare access. Led by Munmun De Choudhury and Mohit Chandra, the study assessed LLM performance against established clinical benchmarks, utilising a dataset compiled from Reddit posts concerning medication side effects. Findings presented at the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics reveal that while LLMs can mimic empathetic communication, they struggle to accurately identify nuanced adverse reactions and provide actionable, clinically aligned advice, highlighting a critical gap in the safety and efficacy of these tools for vulnerable users. The research was funded by the National Science Foundation, the American Foundation for Suicide Prevention, and Microsoft.
AI Chatbots and Psychiatric Medication Reactions
Researchers at the Georgia Institute of Technology have developed a framework to evaluate how well large language models (LLMs) detect adverse drug reactions discussed in online conversations and to assess the alignment of their advice with that of human clinicians. The study focused on individuals self-reporting potential side effects from psychiatric medications, a context requiring a high degree of accuracy. Nine LLMs underwent evaluation, encompassing general-purpose models such as GPT-4o and LLama-3, alongside those specifically trained on medical data. The evaluation methodology utilised data from Reddit, a platform where users frequently discuss medication experiences, and researchers collaborated with psychiatrists and psychiatry students to establish clinically accurate reference responses.
LLM outputs then underwent comparison against these expert assessments across four key criteria: expressed emotion and tone, readability, proposed harm-reduction strategies, and the actionability of those strategies. Researchers also quantified precision in detecting adverse reactions and correctly categorising the type of reaction caused by psychiatric medications. While LLMs successfully mimic the empathetic tone and polite language characteristic of human psychiatrists, they consistently fall short in providing concrete, actionable advice aligned with clinical best practice.
Findings indicate that LLMs struggle with the nuanced comprehension of adverse drug reactions and differentiating between various side effect profiles. The models demonstrated difficulty translating understanding of a reported reaction into effective harm-reduction strategies, highlighting a critical gap between conversational fluency and clinically sound guidance. Improving the actionability and personalisation of LLM advice is crucial, particularly for communities with limited access to mental healthcare where these tools may represent a primary source of information.
Further development should focus on enhancing the models’ ability to accurately interpret subjective experiences and translate that understanding into appropriate, evidence-based recommendations. This will ensure these tools provide genuinely helpful support and do not inadvertently offer misleading or harmful advice to vulnerable individuals.
More information
External Link: Click Here For More
