Controlling the behaviour of increasingly powerful generative systems presents a significant challenge, particularly when multiple, potentially conflicting, objectives must be met. Sunay Joshi, Yan Sun, and Hamed Hassani, from the University of Pennsylvania and New Jersey Institute of Technology, along with Edgar Dobriban, address this problem by developing a new approach to test-time filtering, a technique that modifies system outputs when pre-defined risk thresholds are exceeded. Their work introduces algorithms that efficiently manage multiple risk constraints, prioritised by the user, and guarantees simultaneous control over these risks by exploiting the inherent structure of the data. This research achieves precise control of these constraints, demonstrated through experiments on a complex language model alignment task, and represents a crucial step towards deploying safe and reliable artificial intelligence systems.

The research focuses on enforcing multiple risk constraints with user-defined priorities and introduces two efficient dynamic programming algorithms, MULTIRISK-BASE and MULTIRISK, that exploit sequential structure. These algorithms deliver a direct procedure for selecting thresholds and utilise data exchangeability to guarantee simultaneous control of the risks, achieving nearly tight control of all constraint risks under mild assumptions. The analysis involves intricate iterative arguments, upper and lower bounding risks with symmetrized risk functions, and recursively counting jumps to ensure precise control within defined budgets.

Continuity and Stability of Conditional Expectations

The core of this work establishes the continuity and stability of a function, crucial for proving convergence and consistency of algorithms. The analysis distinguishes between continuous and discrete cases, requiring different approaches to demonstrate that small changes in inputs lead to small changes in outputs. Key to this is a function that projects inputs to suitable values, exhibiting stability through nonexpansive mapping properties. For continuous variables, the research shows that small changes in perturbations lead to small changes in the projected function, with the rate of change controlled by constants determining convergence. In the discrete case, the function remains essentially constant with respect to small perturbations, simplifying the analysis. This detailed work provides a set of lemmas establishing these properties, expressed in terms of bounds and rates of convergence.

Dynamic Risk Control for Generative AI

Scientists have developed a new framework for regulating generative artificial intelligence systems, employing test-time filtering to adjust model outputs based on estimated risk thresholds. The research introduces two dynamic programming algorithms, MULTIRISK-BASE and MULTIRISK, designed to efficiently manage multiple, prioritized risk constraints without model retraining. These algorithms compare performance scores to dynamically selected thresholds, modifying outputs when exceeded to control undesirable behaviors. Experiments demonstrate that the MULTIRISK algorithm achieves nearly optimal control over all defined constraint risks, a significant advancement in managing complex AI behavior. Evaluation using a Large Language Model alignment task, with the PKU-SafeRLHF dataset, shows the algorithms successfully control individual risks while maintaining performance close to target levels. This delivers precise control, ensuring the model operates within acceptable boundaries for safety, reliability, and other critical metrics, offering a lightweight mechanism for responsible AI deployment without extensive retraining.

Dynamic Risk Control for Generative Models

This research presents a new framework for controlling generative models, addressing the challenge of managing multiple risks during operation. The team developed algorithms, including MULTIRISK-BASE and MULTIRISK, that dynamically adjust model outputs based on estimated risk thresholds, ensuring adherence to user-defined priorities. These algorithms leverage dynamic programming and offer distribution-free control over risks, meaning their performance isn’t reliant on specific data assumptions. The researchers demonstrated effectiveness on a Large Language Model alignment task, successfully regulating helpfulness alongside constraints related to harmfulness and uncertainty. Theoretical analysis confirms that the MULTIRISK thresholds approach optimal performance, while experiments show both algorithms effectively manage undesirable characteristics in generated text, establishing a pathway for integrating statistical guarantees into AI deployment.

👉 More information
🗞 MultiRisk: Multiple Risk Control via Iterative Score Thresholding
🧠 ArXiv: https://arxiv.org/abs/2512.24587

Tags:

alignment task data exchangeability Dynamic Programming Helpfulness Large Language Models perplexity filter PKU-SafeRLHF risk constraints risk functions test-time filtering

Generative System Safety Advances Via Iterative Score Thresholding and Risk Prioritization

Continuity and Stability of Conditional Expectations

Dynamic Risk Control for Generative AI

Dynamic Risk Control for Generative Models

Rohail T.

Latest Posts by Rohail T.:

Lasers Unlock New Tools for Molecular Sensing

Light’s Polarisation Fully Controlled on a Single Chip

New Quantum Algorithms Deliver Speed-Ups Without Sacrificing Predictability