The challenge of enabling artificial intelligence to rapidly adapt to new environments without extensive retraining remains a significant hurdle in the field of reinforcement learning. Researchers Anaïs Berkes from the University of Cambridge, Vincent Taboga and Donna Vakalis from Mila – Quebec AI Institute, alongside David Rolnick from McGill University and Yoshua Bengio from Université de Montréal, present a novel approach to address this problem. Their work introduces SPICE, a Bayesian in-context reinforcement learning method which learns a prior over potential outcomes using deep ensembles and refines this understanding with new information. This innovative technique allows agents to achieve near-optimal decisions in unfamiliar tasks, demonstrating substantial improvements over existing methods and offering resilience even when initial training data is imperfect. The team mathematically proves SPICE’s ability to learn effectively in complex scenarios, paving the way for more adaptable and robust AI systems.
In-context reinforcement learning (ICRL) promises fast adaptation to unseen environments without parameter updates, but current methods either cannot improve beyond the training distribution or require near-optimal data. This work introduces SPICE, a novel approach to ICRL through Bayesian fusion of context and value priors, enabling robust learning from limited and suboptimal data. The proposed method leverages a Bayesian neural network to represent both the context and value functions, allowing for uncertainty quantification and effective knowledge transfer.
Bayesian Reinforcement Learning with Uncertainty and Exploration
Researchers developed SPICE, a Bayesian In-Context Reinforcement Learning (ICRL) method designed to rapidly adapt to new environments without updating model parameters. SPICE establishes a prior over Q-values using a deep ensemble, effectively modelling uncertainty, and refines this prior during testing using in-context information via Bayesian updates. To overcome challenges posed by suboptimal training data, the team implemented an Upper-Confidence Bound (UCB) rule within the online inference process, prioritising exploration and improving decision-making even with inaccurate priors. The core innovation lies in fusing contextual data with a learned value prior, creating a robust and adaptable system capable of operating effectively in unseen scenarios.
Mathematical proofs demonstrate that SPICE achieves regret-optimal behaviour in both stochastic bandit problems and finite-horizon Markov Decision Processes, even when initially trained on suboptimal trajectories. The researchers constructed a system that delivers actionable posteriors over Q-values, enabling principled exploration strategies like UCB and Thompson Sampling, unlike many existing ICRL methods that only output logits. Experiments employing both bandit and control benchmarks validate the theoretical findings, demonstrating SPICE consistently achieved near-optimal decisions on unseen tasks, significantly reducing regret compared to previous ICRL and meta-RL algorithms. The study highlights SPICE’s ability to rapidly adapt to new tasks and maintain robustness even when faced with distribution shifts, paving the way for practical applications in robotics and autonomous systems.
Bayesian Reinforcement Learning Adapts Without Parameter Updates Scientists
Scientists have developed SPICE, a novel Bayesian in-context reinforcement learning (ICRL) method that achieves fast adaptation to new environments without updating model parameters. The research introduces a prior over Q-values, implemented via a deep ensemble, which is refined at test-time using in-context information through Bayesian updates. Experiments demonstrate that SPICE effectively recovers from suboptimal priors, even when initial training data is imperfect, by employing an Upper-Confidence Bound rule that encourages exploration and adaptation. Theoretical work proves that SPICE achieves regret-optimal behaviour in both stochastic bandit and finite-horizon Markov Decision Process (MDP) scenarios, even when pretrained on suboptimal trajectories.
Specifically, the cumulative regret of SPICE is bounded in stochastic bandits, and in finite-horizon MDPs, SPICE attains the minimax-optimal regret rate, with any miscalibration in the ensemble prior contributing only a constant warm-start term. This confirms that a well-calibrated prior eliminates the warm-start term entirely, while an uninformative prior reduces SPICE to classical UCB. Empirical validation across bandit and control benchmarks reveals that SPICE achieves near-optimal decisions on unseen tasks, substantially reducing regret compared to existing ICRL and reinforcement learning approaches. Tests demonstrate rapid adaptation to new tasks and robustness under distribution shift, with significant improvements in offline selection quality and online cumulative regret when compared to baseline methods.
Bayesian Adaptation via Gradient-Free Context Fusion
SPICE, a novel Bayesian in-context reinforcement learning method, demonstrates effective adaptation to new environments without updating model parameters. The approach learns a value ensemble prior from potentially suboptimal data using temporal-difference learning and Bayesian shrinkage, then performs Bayesian context fusion at test time to guide action selection. This is achieved by attaching lightweight value heads to a Transformer network, enabling entirely gradient-free adaptation. Theoretical analysis confirms SPICE achieves optimal logarithmic regret in stochastic bandit problems and O(H √ SAK) regret in finite-horizon Markov Decision Processes, even when initial training data is imperfect. Empirical validation across benchmark tasks reveals near-optimal performance on unseen tasks and substantial reductions in regret compared to existing in-context and reinforcement learning methods, alongside robustness to distribution shifts. The authors acknowledge limitations relating to kernel selection for state proximity estimation in non-stationary environments and the requirement for reasonably calibrated priors from the ensemble, suggesting future work could explore methods to mitigate these challenges.
👉 More information
🗞 In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior
🧠 ArXiv: https://arxiv.org/abs/2601.03015
