The increasing promise of zero-shot learning relies on foundation models that can make predictions without task-specific training, but the computational cost of these models often remains unclear. Aayam Bansal and Ishaan Gangwani, both from IEEE, alongside their colleagues, now present a detailed analysis of the hardware demands of leading tabular foundation models. Their work introduces a reproducible benchmark that reveals a significant trade-off between prediction accuracy and resource consumption, demonstrating that traditional tree-based methods often match or exceed the performance of foundation models while requiring dramatically less processing time and memory. This research quantifies the hidden hardware costs associated with current tabular foundation models, establishing a crucial baseline for developing more efficient and sustainable machine learning techniques.
PFN-1. 0 and TabICL-base were evaluated against carefully tuned XGBoost, LightGBM, and Random-Forest models on a single NVIDIA T4 GPU. The tree ensembles equal or surpass foundation model accuracy on three datasets while completing a full-test batch in under 0. 40 seconds and utilizing less than 150 MB of RAM with no VRAM usage. TabICL gains a modest 0. 8 percentage points on the Higgs dataset but incurs approximately 40,000times more latency (960 seconds) and requires 9 GB of VRAM; TabPFN matches tree accuracy on Wine and Housing yet peaks at 4 GB VRAM and cannot process the full 100,000-row Higgs table. These findings quantify a substantial trade-off between hardware requirements and accuracy, delivering an open baseline for future efficiency-oriented research in tabular foundation models.
Foundation Model Benchmarking on Tabular Datasets
The study meticulously benchmarks the performance of zero-shot foundation models against established tree-based methods on tabular data, quantifying both predictive accuracy and hardware demands. Researchers employed four publicly available datasets, Adult-Income, Higgs-100k, Wine-Quality, and California-Housing, for a comprehensive evaluation. Experiments were conducted on a single NVIDIA T4 GPU equipped with 2 vCPUs and 13 GB of RAM, hosted on the Kaggle platform, to ensure a controlled comparison. The methodology involved evaluating five models: XGBoost 1. 7, LightGBM 4.
3, scikit-learn Random Forest, TabPFN-1. 0, and TabICL-base. Tree-based models underwent tuning via a 15-trial randomized search with stratified 3-fold cross-validation, optimizing their performance on each dataset. Foundation models, however, were assessed in a zero-shot manner, without any gradient updates or fine-tuning. A key constraint applied to TabPFN-1.
0 was a limitation to processing a maximum of 10,000 rows, necessitating the use of a random sample for training on Adult, Higgs, and Housing datasets. The study rigorously measured test accuracy, wall-clock latency per test batch, peak RAM usage, and peak VRAM allocated, using the psutil and torch. cuda libraries. Statistical analysis, employing the Friedman test and Nemenyi post-hoc test, was performed on accuracy ranks to determine significant differences between models. Hardware cost ratios were then calculated relative to the performance of XGBoost, establishing a baseline for comparison. This detailed methodology allows for a nuanced understanding of the trade-offs between accuracy and resource consumption in current tabular foundation models.
Foundation Models Versus Tuned Tree Performance
This work presents a comprehensive benchmark evaluating the performance of zero-shot foundation models on tabular data, directly comparing them to tuned gradient-boosted decision trees. Researchers meticulously measured test accuracy, wall-clock latency, peak RAM usage, and peak GPU VRAM consumption across four public datasets, Adult-Income, Higgs-100k, Wine-Quality, and California-Housing, using a single NVIDIA T4 GPU. The study aimed to quantify the hardware demands of these emerging models and establish a baseline for future efficiency-focused research. Experiments demonstrate that tree-based ensembles, specifically XGBoost and LightGBM, achieve consistently high accuracy, reaching 87.
45% on the Adult-Income dataset and exceeding 91% on the California-Housing dataset. Random Forest also delivers competitive results, maintaining strong performance across all datasets. Among the foundation models, TabICL achieves a 0. 8 percentage-point gain on the Higgs dataset, reaching 73. 29% accuracy, and performs well on Wine, achieving 90.
00%. However, TabICL requires substantially more resources to achieve these gains. Crucially, the results reveal significant trade-offs between accuracy and hardware consumption. While TabICL achieves competitive accuracy on certain datasets, it demands approximately 40,000times more latency, reaching 960 seconds, and 9 GB of VRAM compared to the tree-based models, which complete full-test batches in under 0. 40 seconds and utilize minimal RAM. TabPFN, limited to processing 10,000 rows due to architectural constraints, matches the accuracy of tree models on Wine and Housing but peaks at 4 GB VRAM and cannot process the full 100,000-row Higgs table. These findings demonstrate that current tabular foundation models impose a substantial hardware tax, suggesting their primary value lies in rapid prototyping on small tables rather than large-scale production inference.
Tabular Models Offer No Accuracy Advantage
This study presents the first controlled comparison of zero-shot tabular foundation models with tuned decision tree ensembles, evaluating both accuracy and hardware costs. Researchers found no statistically significant difference in overall accuracy between the foundation models and tree-based methods like XGBoost and LightGBM, with performance trading within a narrow margin across four public datasets. However, substantial differences emerged in computational efficiency; tree-based methods completed full-batch inference in under 0. 4 seconds using minimal memory, while one foundation model required 960 seconds and significant video RAM to achieve a marginal accuracy gain on a single dataset.
The findings demonstrate that current zero-shot tabular foundation models do not outperform tuned gradient boosted decision trees in terms of efficiency for medium-scale tabular tasks. While foundation models may be useful for rapid prototyping or exploration on smaller datasets, their current hardware demands preclude their deployment in real-time or resource-constrained environments. The authors suggest future research should focus on developing lightweight variants of these models through techniques like quantisation or distillation, or exploring hybrid pipelines that combine foundation model-generated features with the efficient inference capabilities of tree-based learners. Researchers have made code and data available to facilitate reproducible, hardware-aware evaluation of structured-data foundation models.
👉 More information
🗞 Light-Weight Benchmarks Reveal the Hidden Hardware Cost of Zero-Shot Tabular Foundation Models
🧠 ArXiv: https://arxiv.org/abs/2512.00888
