Machine Learning Achieves Top Performance on 4 Medical Datasets, Study Finds

Medical classification benefits greatly from advances in artificial intelligence, but determining the best approach remains challenging. Researchers Meet Raval from the University of Southern California, Tejul Pandit from Palo Alto Networks, and Dhvani Upadhyay from Dhirubhani Ambani University et al. systematically evaluated whether contemporary foundation models truly outperform traditional machine learning techniques across both text and image-based medical datasets. Their rigorous benchmark, utilising four public datasets, reveals a surprising finding: classical machine learning models consistently achieved the best overall performance, particularly with structured text, exceeding that of both zero-shot Large Language Models (LLMs) and Parameter-Efficient Fine-Tuned (PEFT) models. This work demonstrates that foundation models aren’t a universal solution and highlights the critical importance of adaptation strategies when fine-tuning these powerful, yet potentially unreliable, tools for vital medical applications.

Medical Classification Benchmarking of ML and Transformers reveals

Scientists have demonstrated a rigorous, unified benchmark for medical classification, contrasting traditional machine learning with contemporary transformer-based techniques. This work addresses a critical gap in current research by systematically evaluating model performance across both text and image modalities, utilising four publicly available datasets with varying complexity, binary and multiclass categorisation, to provide empirical clarity. Researchers evaluated three distinct model classes for each task: Classical ML models including Logistic Regression, LightGBM, and ResNet-50; Prompt-Based LLMs/VLMs leveraging Gemini 2. All experiments adhered to consistent data splits and aligned metrics, ensuring a fair and comparable assessment of each approach.
The study reveals that traditional machine learning models consistently achieved the best overall performance across most medical categorisation tasks, establishing a high standard for accuracy and reliability. This was particularly evident with structured text-based datasets, where classical models excelled due to their efficiency and robustness. In contrast, the LoRA-tuned Gemma variants consistently exhibited the worst performance across all experiments, indicating that minimal fine-tuning was detrimental to generalisation and adaptation. This suggests that effective Parameter-Efficient Fine-Tuning requires a more nuanced strategy than simply applying minimal adjustments to foundation models.

However, zero-shot LLM/VLM pipelines utilising Gemini 2.5 yielded mixed results, performing poorly on text-based tasks but demonstrating competitive performance on multiclass image categorisation, matching the baseline established by the classical ResNet-50 model. These findings demonstrate that established machine learning models remain a reliable option in many medical categorisation scenarios, challenging the assumption that foundation models are universally superior. The research establishes that the effectiveness of PEFT is highly dependent on the adaptation strategy, highlighting the need for careful consideration of fine-tuning parameters and techniques. This experiment offers crucial insights for medical AI practitioners, providing a robust, apples-to-apples comparison to aid in selecting the most appropriate modelling approach for practical healthcare deployment. By rigorously harmonising pre-processing, data splitting, and evaluation criteria, the team achieved a comprehensive assessment of both classical ML and contemporary foundation models. The work opens avenues for future research focused on optimising PEFT strategies and exploring the potential of multimodal AI in complex medical classification tasks, ultimately aiming to improve diagnostic accuracy and patient care.

Medical Image and Text Classification Benchmarking is crucial

Scientists undertook a comprehensive benchmarking study contrasting traditional machine learning with contemporary large language and vision-language models for medical classification tasks. The research employed four publicly available datasets, each encompassing both text and image modalities and varying in complexity between binary and multiclass problems. To ensure rigorous comparison, the team meticulously aligned data splits and evaluation metrics across all experiments. Classical machine learning models, specifically Logistic Regression (LR), LightGBM, and ResNet-50, were implemented as baselines, representing established techniques in medical categorization.

Researchers then engineered three distinct model classes for evaluation: classical ML, prompt-based LLMs/VLMs utilising Gemini 2.5, and fine-tuned PEFT models leveraging LoRA-adapted Gemma3 variants. The study pioneered a consistent experimental setup, meticulously controlling for pre-processing steps and data handling procedures across all pipelines. Specifically, the team implemented zero-shot prompting with Gemini 2.5 for initial LLM/VLM evaluation, assessing their ability to perform classification without task-specific training. Subsequently, they fine-tuned Gemma3 variants using LoRA, a parameter-efficient fine-tuning technique, to investigate the impact of adaptation on performance.

Experiments harnessed consistent data splits, ensuring each model was evaluated on the same training, validation, and test sets, and employed aligned metrics to facilitate direct performance comparisons. The team meticulously recorded performance across each dataset and model class, focusing on key metrics relevant to medical classification accuracy. Notably, the study innovated by evaluating models on both structured text and unstructured clinical data, including medical images and integrated reports, to assess generalizability across diverse data types. This approach enabled a nuanced understanding of model strengths and weaknesses in different medical categorization scenarios.

The study’s design deliberately addressed limitations in existing benchmarking practices, notably the lack of cross-modality alignment and inconsistent evaluation rigor. By simultaneously comparing models across text and image modalities, using the same metrics, the research provides a robust, apples-to-apples assessment of performance. Furthermore, the inclusion of both binary and multiclass tasks, particularly in the context of medical imaging with VLMs, expands the scope of evaluation beyond simpler classification problems. This detailed methodology facilitated the discovery that traditional machine learning models often outperform contemporary foundation models in many medical categorization tasks, particularly with structured text data.

Traditional ML outperforms Transformers on medical data

Scientists achieved a rigorous, unified benchmark for medical classification utilising four publicly available datasets encompassing both text and image modalities with varying complexity. The research contrasted traditional Machine Learning (ML) techniques with contemporary transformer-based approaches, evaluating three distinct model classes for each task. Classical ML models, including Logistic Regression (LR), LightGBM, and ResNet-50, were pitted against Prompt-Based LLMs/VLMs, specifically Gemini 2. All experiments employed consistent data splits and aligned metrics to ensure a fair comparison of performance.

Results demonstrate that traditional ML models consistently achieved the best overall performance across the majority of medical categorisation tasks. These models excelled particularly on structured text-based datasets, where they delivered exceptionally high accuracy. Conversely, the LoRA-tuned Gemma variants exhibited the worst performance across all text and image experiments, indicating a failure to generalise effectively from the limited fine-tuning provided. The team measured a significant performance gap between the classical ML models and the LoRA-tuned Gemma variants, highlighting the importance of effective adaptation strategies.

However, zero-shot LLM/VLM pipelines utilising Gemini 2.5 yielded mixed results; they performed poorly on text-based tasks but demonstrated competitive performance on the multiclass image task, matching the classical ResNet-50 baseline. Scientists recorded that Gemini 2.5’s performance on the multiclass image task was comparable to that of the established ResNet-50 model, suggesting potential for image-based applications. This finding suggests that foundation models are not universally superior and that the effectiveness of Parameter-Efficient Fine-Tuning (PEFT) is highly dependent on the adaptation strategy. The experiment confirms that minimal fine-tuning proved detrimental in this study, as the LoRA-tuned Gemma variants consistently underperformed. Measurements confirm that established machine learning models remain the most reliable option in many medical categorisation scenarios. This research rigorously benchmarked three model classes, classical machine learning (LR, LightGBM, ResNet-50), prompt-based LLMs/VLMs (Gemini 2.5), and fine-tuned PEFT models (LoRA-adapted Gemma3 variants), across four publicly available medical datasets encompassing both text and image modalities. Results consistently showed classical machine learning achieving the highest overall performance, particularly with structured text data, while LoRA-tuned Gemma variants exhibited the poorest generalization capabilities. The study’s key finding is that foundation models are not universally superior and require substantial adaptation to compete with established methods; minimal fine-tuning, as initially employed, proved detrimental to performance.

Zero-shot LLM/VLM pipelines (Gemini 2.5) showed mixed results, performing poorly on text but matching ResNet-50 on multiclass image tasks. Importantly, extending the training duration for PEFT models to 5 or 10 epochs significantly improved performance, suggesting that effective adaptation demands more extensive training than commonly assumed for “few-shot” learning. Researchers acknowledged that near-perfect performance on text-based tasks by classical models might indicate easy separability or potential data leakage, highlighting a limitation in definitively attributing success solely to model architecture. Furthermore, the authors noted the increased computational cost and potential safety risks (hallucination, format non-compliance) associated with LLM-based inference compared to classical models. Future work could explore more sophisticated PEFT strategies and investigate the impact of different data augmentation techniques to enhance the adaptability of foundation models. Overall, this research empirically demonstrates that, for many medical classification tasks, LLMs do not represent a complete solution and that traditional methods remain dependable and efficient alternatives.

👉 More information
🗞 LLM is Not All You Need: A Systematic Evaluation of ML vs. Foundation Models for text and image based Medical Classification
🧠 ArXiv: https://arxiv.org/abs/2601.16549

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Entanglement Hyperlinks Achieve Exact Representation of Multipartite Entanglement Entropy for Pure States

Entanglement Hyperlinks Achieve Exact Representation of Multipartite Entanglement Entropy for Pure States

January 28, 2026
Quantum Gravity Landscape Achieves Finite Complexity Bounds with Tame Geometry

Quantum Gravity Landscape Achieves Finite Complexity Bounds with Tame Geometry

January 28, 2026
Quantum Computers Distinguish Synthetic Unravelings, Revealing Dynamics Beyond Ensemble Averages

Quantum Computers Distinguish Synthetic Unravelings, Revealing Dynamics Beyond Ensemble Averages

January 28, 2026