The pursuit of artificial intelligence often focuses on creating single, definitive models, but recent work suggests intelligence arises from the collective of many minds. Diji Yang from University of California Santa Cruz, and Yi Zhang, address this concept by introducing Population Bayesian Transformers, a novel approach that generates diverse yet coherent behaviours from a single set of pre-trained weights. This research moves beyond traditional transformer models by treating key parameters as probabilistic variables, effectively creating a ‘population’ of intelligent agents within one system. The team demonstrates that sampling from this population enhances both exploration and performance across a range of tasks, including zero-shot generation and reinforcement learning, offering a significant step towards more robust and adaptable artificial intelligence.

B-Trans introduces a Bayesian-motivated posterior proxy, treating the bias-like offsets in normalization layers as stochastic variables with a Gaussian variational approximation. This induces a distribution over model behaviour without the cost of training full Bayesian neural networks, and preserves coherence within each generation by freezing the sampled noise at the sequence level.

Uncertainty via Noisy Bias Parameters

This paper proposes a novel approach to improve the reasoning capabilities of Large Language Models (LLMs) by introducing a simple, computationally efficient method for representing uncertainty. Instead of full Bayesian inference, the authors focus on adding noise to the bias terms of normalization layers within the LLM, creating a local proxy for model uncertainty. By sampling different noise configurations, the model effectively behaves as an ensemble, leading to more robust and reliable reasoning. This method significantly improves performance in reinforcement learning scenarios with sparse rewards, suggesting the uncertainty representation helps the model explore more effectively.

The authors add noise to the bias terms of normalization layers during both training and inference, introducing variation in the model’s activations and creating a distribution over possible model behaviors. This allows the model to explore different reasoning paths and make more informed decisions. The approach is computationally inexpensive and easy to implement, making it practical for large-scale LLMs, and validation through controlled experiments confirms the benefits of representing uncertainty even in a simplified form. In essence, the paper argues that even a simple form of uncertainty representation can significantly enhance the reasoning abilities of LLMs, particularly in challenging scenarios like sparse-reward reinforcement learning. The core of this work involves treating bias offsets within normalization layers as stochastic variables, approximated using a Gaussian variational method, which induces a distribution over model behavior without the computational demands of training full Bayesian neural networks. Sampling from this proxy generates instances exhibiting diverse behaviors while maintaining overall competence, effectively creating a “wisdom of crowds” effect within the model. Results demonstrate that aggregating predictions from these sampled instances significantly enhances exploration capabilities, particularly in challenging environments with sparse rewards.

Measurements confirm that B-Trans effectively traverses sparse reward landscapes, achieving deeper exploration than traditional action-space baselines. In a label-free Test-Time Reinforcement Learning setting, the implicit population of B-Trans leverages the wisdom of crowds, outperforming deterministic baselines even without ground-truth supervision. This innovative framework treats the model not as a single entity, but as a population of diverse instances generated from a single set of pre-trained weights, effectively simulating the benefits of collective intelligence. The method introduces a way to represent uncertainty within the model by treating certain parameters as variables, allowing for the sampling of multiple coherent behaviours without the substantial computational cost of fully Bayesian neural networks. The team demonstrates that this population-based approach enhances both the diversity of generated text and performance on challenging tasks, including zero-shot generation and reinforcement learning scenarios with limited feedback. By sampling different model instances and aggregating their predictions, B-Trans leverages a ‘wisdom of crowds’ effect, leading to improved exploration and more robust decision-making. Future research could explore adaptive inference methods, where the level of uncertainty is dynamically adjusted based on the complexity of the input, and ultimately envision a shift towards delivering language models as probabilistic distributions capable of adapting to the specific demands of each query.

👉 More information
🗞 Many Minds from One Model: Bayesian Transformers for Population Intelligence
🧠 ArXiv: https://arxiv.org/abs/2512.25063

Tags:

Bayesian transformers Gaussian Processes Normalization Layers Population Bayesian Transformers Reinforcement Learning semantic diversity sequence-level sampling transformer models variational approximation zero-shot generation

Bayesian Transformers Achieve Diverse Intelligence with Sampling from a Single Model

Uncertainty via Noisy Bias Parameters

Rohail T.

Latest Posts by Rohail T.:

Accurate Quantum Sensing Now Accounts for Real-World Limitations

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently