Batchensemble Fails to Achieve Ensemble Performance on CIFAR10/10C/SVHN Datasets

Researchers are increasingly focused on obtaining reliable uncertainty estimates for machine learning models operating with limited computational resources. Anton Zamyatin, Patrick Indri, and Sagar Malhotra, all from TU Wien alongside Thomas Gärtner, investigated whether BatchEnsemble , a technique designed to mimic the benefits of full ensembles with reduced computational cost , truly functions as an ensemble of models. Their work reveals that BatchEnsemble surprisingly fails to deliver ensemble-like performance on standard benchmarks like CIFAR10, SVHN and MNIST, instead behaving more like a single model in terms of accuracy, calibration and ability to detect out-of-distribution data. This finding is significant because it challenges the assumed efficiency gains of BatchEnsemble and highlights the importance of genuine diversity within ensemble members for robust uncertainty quantification.

The research, published as a preprint and slated for presentation at the 2025 Workshop on Epistemic Intelligence in Machine Learning, reveals that BatchEnsemble not only underperforms traditional Deep Ensembles but closely mirrors the performance of a single model in terms of accuracy, Calibration, and out-of-distribution (OOD) detection on benchmark datasets including CIFAR-10, CIFAR-10C, and SVHN. The team achieved this by rigorously comparing BatchEnsemble to Deep Ensembles and MC Dropout across various image classification tasks, meticulously evaluating predictive performance, calibration, and OOD detection capabilities. This study unveils a critical limitation of BatchEnsemble: its inability to effectively explore the parameter space compared to true ensembles.

Researchers theoretically highlight that BatchEnsemble, which applies learned rank-1 perturbations to a shared weight matrix, can only access a minuscule portion of the parameter space achievable by independently trained Deep Ensemble members. A controlled experiment conducted on the MNIST dataset further confirms this, showing that BatchEnsemble members are remarkably similar in both their functional behaviour and parameter values, indicating a limited capacity to generate diverse predictive modes. Consequently, the work establishes that BatchEnsemble behaves more like a single model than a genuine ensemble, failing to capture the robust epistemic uncertainty that ensembles are known for.

BatchEnsemble for Efficient Uncertainty Quantification in Neural Networks

Scientists investigated the efficacy of BatchEnsemble, a technique designed to reduce the computational cost of ensemble methods for uncertainty estimation in neural networks. The research directly addressed the challenge of obtaining robust epistemic uncertainty (EU) in resource-constrained and low-latency environments, where training multiple full-size models is often impractical. To rigorously assess BatchEnsemble, the team compared its performance against both Deep Ensembles and MC Dropout across a range of benchmark datasets. Experiments employed CIFAR-10, CIFAR-10C, and SVHN datasets to evaluate predictive performance, calibration, and out-of-distribution (OOD) detection capabilities.

The study pioneered a controlled analysis on the MNIST dataset to examine the functional and parameter space diversity of BatchEnsemble members. Researchers implemented BatchEnsemble by applying learned rank-1 perturbations to a shared weight matrix, effectively embedding an ensemble within a single network, a technique designed to reduce parameter and memory costs. Specifically, for a layer with weight matrix W ∈ Rm×n, member-specific perturbations were calculated as Wi = W ◦ris⊤i, where ◦ denotes the Hadamard product and ris⊤i represents a rank-1 matrix. This approach reduces the per-layer parameters from O(kmn) , the cost of independent members in Deep Ensembles , to O(mn + k(m+n)), achieving substantial parameter reduction when k is significantly smaller than min(m, n).

The team harnessed a vectorized implementation of the forward pass, enabling all k members to be computed in a single batched operation, facilitating within-device parallelism and high-throughput inference. This contrasts with MC Dropout, which requires multiple stochastic forward passes to generate an ensemble of predictions. The research revealed that BatchEnsemble members on MNIST exhibited near-identical function and parameter characteristics, suggesting a limited capacity to realize distinct predictive modes, a critical finding that challenges its effectiveness as a true ensemble. Furthermore, empirical results demonstrated that BatchEnsemble consistently underperformed both Deep Ensembles and MC Dropout in terms of predictive performance across all evaluated metrics.

CIFAR-10 Performance of Single Models and Ensembles

Scientists achieved accuracy of 0.941 ±0.004 on the CIFAR-10 in-distribution benchmark using a single model, alongside a Negative Log Likelihood (NLL) of 0.237 ±0.003 and an Expected Calibration Error (ECE) of 0.034 ±0.003. Experiments revealed that on the CIFAR-10C distribution shift benchmark (intensity 5), the single model achieved an accuracy of 0.558 ±0.008, an NLL of 2.545 ±0.142, and an ECE of 0.323 ±0.014. MC Dropout scored 0.578 ±0.002 accuracy, 2.054 ±0.084 NLL, and 0.251 ±0.009 ECE, while Deep Ensembles reached 0.575 ±0.006 accuracy, 1.682 ±0.045 NLL, and 0.206 ±0.009 ECE. The team measured BatchEnsemble’s performance as 0.547 ±0.015 accuracy, 2.357 ±0.164 NLL, and 0.308 ±0.012 ECE.

Tests prove that on the SVHN out-of-distribution benchmark, Deep Ensembles achieved the highest Area Under the Precision-Recall curve (AUPR) of 0.953 ±0.006 and Area Under the Receiver Operating Characteristic curve (AUROC) of 0.924 ±0.011, with a False Positive Rate at 95% True Positive Rate (FPR95) of 0.199 ±0.024. MC Dropout scored 0.932 ±0.008 AUPR, 0.884 ±0.009 AUROC, and 0.324 ±0.019 FPR95, while BatchEnsemble yielded 0.932 ±0.015 AUPR, 0.888 ±0.020 AUROC, and 0.297 ±0.061 FPR95. Measurements confirm that BatchEnsemble’s JSD remained near-zero at 0.010 ±0.006, indicating limited epistemic uncertainty, while Deep Ensembles exhibited a JSD of 0.277 ±0.032. A controlled study on MNIST with a 3-layer perceptron and k=4 members revealed that BatchEnsemble members exhibited near-identical function and parameter characteristics. Pairwise prediction disagreement was near-zero across in-distribution, distribution-shifted, and out-of-distribution test sets, and cosine similarity of member weights approached 1. This demonstrates a lack of functional and parametric diversity within BatchEnsemble, suggesting it behaves more like a single model than a true ensemble of distinct models.

BatchEnsemble fails to emulate ensemble diversity effectively

Scientists have demonstrated that BatchEnsemble, a method designed to mimic the uncertainty estimation of full ensembles with reduced computational cost, largely behaves like a single neural network. Their research, conducted on datasets including CIFAR-10, CIFAR-10C, SVHN, and MNIST, reveals that BatchEnsemble fails to replicate the predictive disagreement and diversity characteristic of true ensembles. Across these benchmarks, the method’s accuracy, calibration, and out-of-distribution detection capabilities closely aligned with those of a single model, indicating limited epistemic uncertainty. A controlled study on the MNIST dataset further substantiated these findings, showing near-identical functional and parameter behaviour among BatchEnsemble members.

This suggests a limited capacity to generate genuinely distinct predictive modes, despite the intention of achieving ensemble-like uncertainty with fewer parameters. The authors acknowledge a discrepancy between their results and previously reported stronger performance of BatchEnsemble, attributing this to differences in model scale. They recommend future work to investigate BatchEnsemble’s behaviour with increased model size, depth, and training time, as well as across diverse architectures and tasks, such as Transformers and regression problems. Additionally, developing a theoretical understanding of the limitations imposed by BatchEnsemble’s rank-1 parameterization could clarify the method’s fundamental constraints. Researchers have shown that BatchEnsemble, a method designed to mimic the uncertainty estimation of full ensembles with reduced computational cost, largely behaves like a single neural network.

👉 More information
🗞 Is BatchEnsemble a Single Model? On Calibration and Diversity of Efficient Ensembles
🧠 ArXiv: https://arxiv.org/abs/2601.16936

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Lol: Advances Hour-Long Video Generation, Resolving Sink-Collapse with RoPE Jitter

Lol: Advances Hour-Long Video Generation, Resolving Sink-Collapse with RoPE Jitter

January 28, 2026
Bichromatic Tweezers Achieve Enhanced Qudit Coherence in Strontium Systems

Bichromatic Tweezers Achieve Enhanced Qudit Coherence in Strontium Systems

January 28, 2026
Gesture Recognition Achieves 98.13% Accuracy Using Body-Worn RFID and AI

Gesture Recognition Achieves 98.13% Accuracy Using Body-Worn RFID and AI

January 28, 2026