Tabular Foundation Models are rapidly emerging as powerful tools for analysing structured data, demonstrating impressive zero-shot performance that rivals traditional machine learning techniques. Aditya Tanna, Pratinav Seth, and Mohamed Bouadi, all from Lexsi Labs, alongside Vinay Kumar Sankarapu, investigate the often-assumed benefits of further refining these models through fine-tuning. Their comprehensive study, the first of its kind across benchmarks like TALENT, OpenML-CC18, and TabZilla, reveals that zero-shot TFMs already perform strongly and the advantages of fine-tuning are surprisingly variable. This research is significant because it challenges conventional wisdom, demonstrating that full supervised fine-tuning can sometimes reduce accuracy, and provides crucial guidance on when and how to effectively leverage fine-tuning for tabular data, considering factors such as data imbalance, size and dimensionality, as well as impacts on calibration and fairness.
Methodology
The evaluation comprised binary and multi-class classification tasks conducted on a shared subset of datasets drawn from three major benchmarks: TALENT, TabZilla, and OpenML-CC18. The study assessed a range of models, including XGBoost, LightGBM, CatBoost, TabR, TabPFN, TabDPT, MITRA, OrionMSP, TabICL, and OrionBiX. Performance was compared under three learning regimes: zero-shot inference, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT).
Key Findings
Tabular foundation models (TFMs) demonstrated strong zero-shot performance across diverse datasets, often outperforming fine-tuned variants in both accuracy and F1 score. In contrast, full supervised fine-tuning frequently degraded performance and calibration, particularly on small datasets, and proved beneficial only for models supporting full parameter updates, such as TabPFN.
Parameter-efficient fine-tuning, implemented via low-rank adaptation methods such as LoRA, successfully recovered performance losses while mitigating overfitting. PEFT was especially effective for models like TabDPT and TabPFN, offering stable improvements with minimal computational overhead.
Meta-learning yielded modest but consistent gains, particularly for imbalanced datasets. Practitioner-oriented guidelines emerging from the study recommend prioritizing zero-shot inference for small datasets, applying fine-tuning selectively for medium-sized or wide-feature datasets, and considering meta-learning when class imbalance is a concern. Caution is advised when applying full supervised fine-tuning, as it often compromises both predictive accuracy and calibration.
Model and Dataset-Specific Insights
OrionMSP and TabPFN consistently achieved the strongest zero-shot performance across benchmarks. While meta-learning produced moderate architecture-dependent gains, full supervised fine-tuning significantly reduced accuracy and F1 scores for several models, most notably TabICL and OrionBiX. PEFT variants partially mitigated these losses, with TabDPT exhibiting the most robust and stable improvements.
Performance trends varied with dataset size. Fine-tuning was most beneficial for medium-sized datasets (1,000–10,000 samples), where TabPFN and OrionMSP showed small but consistent gains. For smaller datasets (<1,000 samples), zero-shot inference remained superior due to overfitting during fine-tuning. On large datasets (>10,000 samples), adaptation offered limited advantages over zero-shot predictions.
Calibration and Fairness
Calibration analysis, measured using Expected Calibration Error (ECE), indicated that zero-shot TFMs delivered the most reliable predictions overall, with OrionMSP and TabPFN achieving the lowest ECE values. Meta-learning largely preserved calibration quality, whereas supervised fine-tuning substantially worsened calibration across most models.
Fairness evaluation revealed a trade-off between predictive performance and equity. Mitra achieved the lowest disparity metrics but at the cost of significantly reduced accuracy. Zero-shot and meta-learning approaches generally provided the best balance between performance and fairness, while supervised fine-tuning exhibited the largest variability in fairness outcomes.
Limitations
The study focused exclusively on binary and multi-class classification tasks. PEFT was not supported for all evaluated models, and sensitive attributes used for fairness evaluation were manually defined and varied across datasets.
Conclusion
TFMs exhibit strong zero-shot performance across a wide range of tabular benchmarks. While fine-tuning can be beneficial in specific scenarios—particularly for medium-sized datasets—full supervised fine-tuning often degrades performance and calibration. Meta-learning and PEFT offer moderate, reliable gains without severe trade-offs. Overall, zero-shot inference remains the most robust choice in low-data regimes or when calibration and fairness are critical considerations.
Tabular Models Zero-Shot and Fine-Tuning Performance
This work establishes that tabular foundation models demonstrate considerable capability through zero-shot inference on structured data, often rivaling traditional machine learning techniques. While fine-tuning these models can offer benefits, the extent of improvement is notably dependent on both the specific model architecture employed and the characteristics of the dataset itself. The research details how supervised fine-tuning does not consistently improve performance and can, in some instances, negatively impact both accuracy and calibration quality. The study provides a comprehensive analysis of fine-tuning strategies, including zero-shot inference, parameter-efficient fine-tuning, and full supervised fine-tuning, across several established benchmarks.
Findings highlight the importance of considering factors such as data imbalance, dataset size, and dimensionality when deciding whether to adapt a tabular foundation model or rely on its inherent zero-shot abilities. The authors acknowledge limitations including a focus on binary and multi-class classification tasks, and the manual definition of sensitive attributes used for fairness evaluation. Furthermore, the evaluation was conducted on a common subset of data from each benchmark, reducing the total data available for analysis. Future work could explore the application of these models to a wider range of tabular data tasks and investigate methods for automatically defining fairness-related attributes. These insights offer practical guidance for practitioners seeking to leverage tabular foundation models effectively, emphasizing a nuanced understanding of the trade-offs between adaptation and zero-shot performance.
👉 More information
🗞 Exploring Fine-Tuning for Tabular Foundation Models
🧠 ArXiv: https://arxiv.org/abs/2601.09654
