Chronic rhinosinusitis (CRS) presents a significant challenge for clinicians determining appropriate surgical candidates, and new research investigates whether artificial intelligence can improve patient selection. Sayeed Shafayet Chowdhury, Snehasis Mukhopadhyay, and Shiaofen Fang from Purdue University, alongside Vijay R. Ramakrishnan from Indiana University School of Medicine, et al, have compared the predictive power of supervised machine learning with that of generative AI models , including ChatGPT, Claude, and Gemini , to forecast positive surgical outcomes based solely on pre-operative clinical data. This study is particularly noteworthy because it moves beyond image analysis to focus on prospective decision support, asking whether AI could have identified patients likely to experience poor outcomes and potentially avoided unnecessary surgery. Their findings demonstrate superior performance from a calibrated machine learning model, suggesting an ‘ML-first, GenAI-augmented’ workflow could optimise surgical candidacy triage and enhance patient-clinician shared decision-making.
Predicting Sinus Surgery Success with Machine Learning
The research team focused on identifying, pre-operatively, those patients unlikely to experience a clinically meaningful improvement, defined as a reduction of at least 8.9 points in their SNOT-22 score at six months post-surgery, suggesting they might benefit from avoiding surgery altogether. The core of this work involved constructing and analysing a prospectively collected cohort of CRS patients who all underwent surgery, allowing researchers to assess whether pre-operative clinical data alone could accurately predict poor outcomes. Each model, logistic regression, tree ensembles, an in-house multilayer perceptron (MLP), and the four GenAI systems, received the same structured inputs and was constrained to provide binary recommendations with associated confidence levels. This demonstrates a significant advancement in predictive capability for surgical candidacy.
The researchers developed a reproducible tabular-to-GenAI evaluation protocol and conducted detailed subgroup analyses to ensure the robustness of their findings. This meticulous approach establishes a framework for responsible and transparent AI implementation in clinical practice. This breakthrough establishes a clinically grounded, reproducible comparison of ML and GenAI for pre-operative CRS decision support, emphasizing not only accuracy but also calibration, net benefit, and responsible use. Ultimately, this research opens the door to more personalized and effective treatment strategies for CRS, potentially improving patient outcomes and reducing unnecessary surgical interventions.
Data Harmonisation and Pre-processing for ESS prediction
Scientists embarked on a rigorous investigation into predicting surgical outcomes for patients with chronic rhinosinusitis (CRS), focusing on identifying those unlikely to experience clinically meaningful improvement following endoscopic sinus surgery (ESS). The study leveraged data from a combined cohort of 524 surgical cases, merging two previously collected datasets, one comprising 791 patients and another with 355, and restricting analysis to individuals who underwent ESS. Researchers meticulously pre-processed the data, removing post-operative variables to prevent leakage and harmonising variable names across both cohorts, ensuring a clean and consistent dataset for analysis. Categorical variables were deterministically encoded using fixed dictionaries, while continuous measures like baseline SNOT-22 scores, CT Lund, Mackay scores, and endoscopy results were retained in their numeric form.
To establish a robust methodology, the team implemented strict leakage checks, confirming the absence of post-operative variables in the feature set and ensuring all preprocessing steps were confined within cross-validation folds. A stratified 80/20 train, test split was then executed, preserving class prevalence, and class imbalance was addressed through class weighting for logistic regression and tree ensembles, and focal/weighted loss functions for the multi-layer perceptron (MLP). All random seeds and preprocessing artifacts were carefully versioned to guarantee reproducibility of the work. The researchers then evaluated five supervised machine learning classifiers, logistic regression, support vector machines, naïve Bayes, random forests, and an in-house MLP, using identical pre-processing and class weighting schemes.
Notably, the MLP demonstrated superior performance, achieving 85% accuracy with well-calibrated probabilities and the best minority-class recall among the models tested, and was therefore selected as the primary supervised baseline for subsequent comparisons. The MLP architecture comprised a single hidden layer with 400 neurons, carefully engineered to optimise predictive power. Simultaneously, the study pioneered a novel tabular-to-GenAI evaluation protocol, benchmarking five large language models, ChatGPT (GPT-5 Thinking), MedGPT, Gemini 2.5 Pro, Perplexity Sonar, and Claude Sonnet 4.5, against the MLP. Each model received the same structured pre-operative clinical inputs, with outputs constrained to binary recommendations regarding surgical candidacy, based on a minimum clinically important difference (MCID) threshold of 8.9 points on the SNOT-22 questionnaire at 6 months. This innovative approach enabled a direct comparison of supervised machine learning and generative AI models in predicting surgical benefit, providing valuable insights into their respective strengths and limitations. The team meticulously logged vendor details, model identifiers, and access dates for each LLM run, ensuring transparency and reproducibility of the GenAI inference process.
MLP Model Predicts CRS Surgery Success Accurately
The research, focused on pre-operative clinical data, identified patients likely to experience a meaningful improvement, defined as more than an 8.9-point reduction in SNOT-22 scores at 6 months, representing the minimal clinically important difference (MCID). The team measured performance using a prospectively collected cohort of 524 patients who all underwent surgery, ensuring a robust evaluation of predictive capabilities. Results demonstrate the MLP model’s superior ability to accurately triage surgical candidates, potentially identifying those who may not benefit from the procedure and thus avoiding unnecessary interventions. Notably, the GenAI models consistently underperformed in zero-shot settings, highlighting the importance of calibrated ML for primary clinical decision support.
The breakthrough delivers a reproducible tabular-to-GenAI evaluation protocol, enabling standardized comparison and analysis of these technologies. Tests prove that the combination of a calibrated ML classifier for initial triage, augmented by GenAI as an explainer, offers a promising workflow for enhancing transparency and shared decision-making in CRS treatment. The study utilized data from an NIH-funded, multicenter cohort, comprising 791 patients initially, narrowed to 524 surgical cases after filtering for those undergoing endoscopic sinus surgery. Researchers meticulously pre-processed the data, removing post-operative variables to prevent leakage and harmonizing features across two cohorts, a 791-patient set and a 355-patient set.
The team retained routinely available pre-operative factors, including demographics, comorbidities, imaging scores, and baseline SNOT-22 scores, to build a robust predictive model. Analysis of the SNOT-22, a validated instrument for measuring quality of life in CRS patients, revealed its consistent importance as a predictor of post-operative gains. This work establishes a clinically grounded, reproducible comparison of ML and GenAI, emphasizing not only accuracy but also calibration, net benefit, and responsible use in pre-operative CRS decision support.
MLP Outperforms GenAI in CRS Prediction, demonstrating superior
Researchers prospectively analysed data from patients undergoing surgery, investigating whether pre-operative clinical information could identify individuals unlikely to benefit, potentially avoiding unnecessary procedures! This suggests GenAI’s strength lies in explaining decisions using clinically familiar language, rather than serving as the primary predictive tool. The authors acknowledge limitations including the retrospective nature of the study, its focus on a single clinical domain, and the use of a specific minimally clinically important difference (MCID) threshold, introducing potential label noise. Future research should focus on prospective, multi-site validation, exploring selective prediction with abstention options, and developing hybrid models combining the strengths of calibrated tabular ML with LLM explainability.
👉 More information
🗞 Who Should Have Surgery? A Comparative Study of GenAI vs Supervised ML for CRS Surgical Outcome Prediction
🧠 ArXiv: https://arxiv.org/abs/2601.13710
