Ensuring artificial intelligence systems align with human values and operate safely represents a fundamental challenge in the field, and researchers are now exploring how to build truly reliable and trustworthy AI. Alessio Benavoli from Trinity College Dublin, alongside Alessandro Facchini and Marco Zaffalon from the Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, investigate this problem through the lens of ‘assistance’ and ‘shutdown’ scenarios, common frameworks for evaluating AI safety. Their work demonstrates that creating AI capable of safely assisting humans or reliably shutting down on request demands more than simply programming desired outcomes, it requires systems that can actively reason under conditions of uncertainty and accommodate the complexities of incomplete and even seemingly irrational human preferences. This research significantly advances our understanding of the necessary conditions for safe AI, moving beyond simplistic models to embrace the nuances of real-world human behaviour.
AI Safety, Decision Theory, and Uncertainty
This work examines a broad range of research concerning artificial intelligence safety, decision theory, and the handling of uncertainty, encompassing areas like value alignment, reward learning, and ensuring AI systems remain under human control. The research explores how to build AI that aligns with human values, avoids unintended consequences, and operates safely, drawing on concepts from decision theory, game theory, and Bayesian methods. Key themes include modelling incomplete preferences, imprecise probabilities, and the challenges of learning from human feedback. Foundational work in utility theory, incomplete preferences, and stochastic choice provides the basis for understanding how decision-makers operate when preferences are not fully defined.
Researchers are applying these concepts to AI safety, particularly in the context of reward learning and reinforcement learning from human feedback, to create systems that accurately reflect human intentions. Investigations into game theory, specifically multi-agent systems, are informing approaches to the ‘off-switch’ problem, ensuring AI can be safely deactivated when necessary. The research also delves into methods for dealing with uncertainty, utilizing imprecise probabilities and credal sets to represent situations where probabilities are unknown or subjective. Bayesian optimization techniques are being employed to learn human preferences and align AI systems accordingly. Combining these approaches, scientists are exploring how to create robust AI systems that can reason under uncertainty, accommodate incomplete preferences, and make ethical decisions, ultimately leading to more trustworthy and beneficial artificial intelligence.
Learning Human Preferences with Gaussian Processes
Researchers have pioneered a new framework for aligning artificial intelligence with human values, focusing on the assistance and shutdown problems, and employing sophisticated computational methods to model human preferences. The team developed a system that addresses challenges arising from incomplete and non-Archimedean preferences, requiring AI capable of reasoning under uncertainty. This work centers on learning utility functions, acknowledging that these are often unknown and must be inferred from human choices. The system utilizes Gaussian Processes (GPs) as a prior over unknown utility functions, enabling the computation of posterior distributions given observed preferences.
By approximating the posterior using techniques like Laplace’s approximation and Kullback-Leibler divergence minimization, the method efficiently handles the complexity of preference landscapes. Experiments, using a finite choice set, demonstrate the system’s ability to approximate hidden utility functions from limited preference data, even when complete identification is impossible. Further research explored scenarios involving multiple, potentially conflicting, utility functions, requiring the system to learn these simultaneously. The likelihood function incorporates conditions ensuring that chosen options are not dominated in both utilities, allowing the system to estimate the posterior marginals for each utility. This approach demonstrates the potential for AI to learn and adapt to complex human preferences, even in the presence of uncertainty and conflicting desires.
AI Alignment Requires Explicit Uncertainty Modelling
Scientists have made significant advances in aligning artificial intelligence with human values, addressing challenges in assistance and shutdown scenarios. The research demonstrates that robust AI systems require the ability to reason under uncertainty and accommodate incomplete, non-Archimedean preferences, moving beyond traditional deterministic models. Experiments reveal that AI assistants consistently defer to human judgment when humans are fully rational. However, when modelling bounded rationality, the team proved that for the AI to remain under supervision, it must explicitly model its uncertainty regarding the human’s utility function.
This highlights the critical limitation of current preference-based alignment techniques, which assume complete preferences. Forcing a human to choose between incomparable alternatives leads to seemingly irrational behaviour from the AI’s perspective, as humans may legitimately find some options incomparable. The team reproved existing results demonstrating the difficulty in designing reliably shutdownable and genuinely useful AI agents, and reformulated the problem as an instance of the AI assistance game. Introducing the concept of ‘mutual preferential independence’, where human preferences for shutdown are independent of the task, they discovered that achieving both shutdownability and usefulness requires non-Archimedean preferences, specifically through the use of lexicographic utilities, prioritizing adherence to shutdown commands.
Uncertainty and Preferences in AI Alignment
This research addresses the critical challenge of aligning artificial intelligence with human values and ensuring safety, framing it through the related problems of assistance and shutdown. The team demonstrates that effectively solving these problems necessitates AI systems capable of reasoning under uncertainty and handling preferences that are not always easily quantifiable. Specifically, the work proves that a system must account for uncertainty when learning human preferences, rejecting approaches that rely solely on deterministic predictions. Researchers developed signalling games incorporating posterior uncertainty derived from preference learning, and explored various selection strategies for an intelligent agent, evaluating them through numerical experiments.
These strategies, including ‘natural’, ‘corporate’, and ‘collaborative’ approaches, were assessed based on how well the agent could propose actions aligned with human utility, even with incomplete information. The findings support the use of probabilistic methods in artificial intelligence, as modelling uncertainty is fundamental to building reliable and safe systems. The authors acknowledge that their analysis relies on specific assumptions about the statistical distributions governing preferences and noise, and that further research is needed to explore the robustness of these findings. Future work should focus on extending these models to more complex scenarios and investigating how these principles can be applied to real-world applications, ultimately leading to more trustworthy and beneficial artificial intelligence systems.
👉 More information
🗞 Why AI Safety Requires Uncertainty, Incomplete Preferences, and Non-Archimedean Utilities
🧠 ArXiv: https://arxiv.org/abs/2512.23508
