Researchers are increasingly exploring how to make artificial intelligence more interpretable and useful for real-world health interventions. Shovito Barua Soumma, Asiful Arefeen, and Stephanie M. Carpenter, from Arizona State University, alongside Melanie Hingle (University of Arizona) and Hassan Ghasemzadeh, demonstrate a novel approach using large language models (LLMs) to generate ‘counterfactual explanations’ , essentially, identifying the smallest changes needed to achieve a different outcome from a predictive model. Their work, detailed in a new paper, assesses the performance of models like GPT-4, BioMistral-7B, and LLaMA-3.1-8B, both in standard and fine-tuned configurations, using a clinical dataset , revealing that fine-tuned LLMs, especially LLaMA-3.1-8B, can produce highly plausible and clinically relevant interventions. Significantly, these LLM-generated counterfactuals not only offer interpretable insights but also substantially improve model performance when training data is limited, offering a flexible and model-agnostic pathway towards more robust and effective digital health technologies.
1-8B, both in standard and fine-tuned configurations, using a clinical dataset, revealing that fine-tuned LLMs, especially LLaMA-3.1-8B, in both pretrained and fine-tuned configurations, assessing their ability to identify minimal, actionable changes needed to alter a machine learning model’s prediction. Fine-tuned LLMs, notably LLaMA-3.1-8B, consistently produced CFs exhibiting up to 99% plausibility and 0.99 validity, alongside realistic and behaviourally modifiable feature adjustments. This study unveils a novel method for not only providing human-centric interpretability but also for augmenting training data to enhance model performance, particularly in scenarios with limited labelled data. This innovative technique addresses limitations of traditional methods, which often struggle with categorical coherence and clinically plausible modifications.
Specifically, the SenseCF framework fine-tunes an LLM to generate valid, representative counterfactual explanations and supplement minority classes in imbalanced datasets, thereby improving model training and boosting predictive performance. As illustrated in accompanying figures, classifiers experience marked declines in F1-score as training data is reduced, highlighting the vulnerability of standard models and motivating the need for synthetic augmentation via LLM-generated counterfactuals. This research represents a significant step towards AI systems capable of providing both accurate predictions and actionable insights in critical healthcare applications. Furthermore, this study systematically compares GPT-4 with open-source LLMs, providing a rigorous and quantitative comparison in multimodal clinical settings. By addressing gaps in the current literature, including the lack of comprehensive evaluation on large clinical datasets and standardized evaluation metrics, this work provides a valuable contribution to the field of explainable AI and its application to digital health. The findings suggest that LLM-driven counterfactuals hold immense promise for creating more transparent, robust, and effective healthcare solutions.
LLM Counterfactuals for Clinical Data Evaluation offer promising
Experiments employed a rigorous methodology, beginning with training several classifiers, Support Vector Machines, Random Forests, XGBoost, and Neural Networks, on the AI-READI dataset to establish baseline performance under varying levels of data reduction. The team then generated counterfactual explanations using each LLM, prompting them to identify minimal changes to input features that would alter the model’s prediction. To quantify intervention quality, scientists assessed plausibility and validity, achieving up to 99% plausibility and 0.99 validity with fine-tuned LLMs, particularly LLaMA-3.1-8B. Feature diversity was measured by analysing the range of adjusted features within the generated counterfactuals, ensuring realistic and behaviourally modifiable alterations.
The research pioneered a data augmentation technique, introducing LLM-generated CFs as synthetic training samples under controlled label-scarcity settings. Specifically, the team reduced the training data by 10%, 20%, 30%, 40%, 50%, 60%, and 70% to simulate realistic clinical scenarios where labelled data is limited. They then retrained the classifiers using the original data augmented with CFEs, measuring the recovery of F1-score performance. The study’s findings highlight the vulnerability of standard models to label scarcity and motivate the need for principled synthetic augmentation via LLM-generated counterfactuals.,.
LLMs generate valid, plausible clinical counterfactuals
Fine-tuned LLMs, notably LLaMA-3.1-8B, produced CFs exhibiting high plausibility, reaching up to 99%, and strong validity, peaking at 0.99, alongside realistic and behaviorally modifiable feature adjustments. Specifically, in Scenario A, positive-class undersampling, fine-tuned LLaMA* achieved a remarkable 21.00% increase in accuracy, 20.00% in precision, 24.56% in recall, 22.41% in F1 score, and 25.37% in AUC, relative to the reduced dataset. These gains demonstrate the power of CFEs to mitigate performance drops caused by imbalanced data. The team measured sparsity using the formula ∑X∗ T ∈CF ∑d i=1 1(x∗i T = xi T) ∥CF∥, ensuring better user understanding of the generated CFs.
Results demonstrate that fine-tuned BioMistral-7B and LLaMA-3.1-8B significantly improved validity, sparsity, and distance compared to their pretrained counterparts, with gains of 20, 40% points in validity and reductions exceeding 50% in feature distance. A counterfactual intervention example illustrated how LLMs can propose clinically meaningful modifications for a high-stress patient, identifying low deep sleep (30.1%), moderate REM sleep (15.4%), elevated glucose (210.8mg/dL), and low activity (5.95 steps) as key contributors to stress. The LLM suggested increasing deep sleep to 35% and REM sleep to 20%, alongside lowering blood glucose to 180mg/dL, reflecting clinically actionable strategies. Table III shows that LLaMA* achieved near-perfect validity with minimal, clinically realistic modifications, while traditional methods often proposed unrealistic feature shifts. Feature diversity analysis, visualized via radar plots, highlighted that fine-tuned LLMs concentrated on highly actionable variables, average steps, glucose levels, and hyperglycemia frequency, factors readily modifiable through lifestyle or treatment adjustments.,.
LLMs boost data robustness via counterfactuals, improving generalization
This research establishes that LLM-generated counterfactuals exhibit semantic coherence and clinical plausibility, proving capable of enhancing downstream robustness when applied to data augmentation, restoring, on average, 20% F1 score under conditions of severe label scarcity. Specifically, fine-tuned LLaMA and BioMistral models produced compact, actionable CFs that surpassed their pretrained counterparts and proved competitive with existing optimization methods. To the best of the authors’ knowledge, this represents the first systematic investigation of LLM-based CFs applied to sensor-driven data, both in zero- and few-shot settings, opening a promising avenue for integrating generative AI into trustworthy, intervention-focused healthcare machine learning pipelines. The authors acknowledge limitations including the potential for unrealistic feature changes, suggesting future work could incorporate clinical knowledge graphs or causal structures into the fine-tuning process. Further research directions include extending the approach to multimodal data, such as raw sensor traces or clinical notes, and assessing the long-term impact of CF-based guidance on early intervention and patient outcomes.
👉 More information
🗞 Counterfactual Modeling with Fine-Tuned LLMs for Health Intervention Design and Sensor Data Augmentation
🧠 ArXiv: https://arxiv.org/abs/2601.14590
