Generative artificial intelligence systems, while increasingly powerful, currently lack robust security guarantees, relying instead on a constant cycle of attack and defence. Yu Cui, Hang Fu, and Sicheng Pan, from their respective institutions, alongside colleagues, address this critical challenge with a new approach to building generative models with theoretically controllable risk. Their research introduces Reliable Consensus Sampling, a technique that enhances the existing Consensus Sampling algorithm by tracing acceptance probability and eliminating the need for systems to abstain from generating outputs. This innovation not only improves the robustness of generative AI against malicious manipulation, but also maintains high performance, representing a significant step towards provably secure and trustworthy artificial intelligence.
Experience frequently gives rise to previously unknown attacks that can circumvent current detection and prevention, necessitating the continual updating of security mechanisms. Constructing generative AI with provable security and theoretically controllable risk is therefore necessary. However, research finds that CS relies on frequent abstention to avoid unsafe outputs, which reduces utility, and becomes vulnerable when unsafe models are maliciously manipulated, prompting the development of new approaches.
Reliable Consensus for Multi-Agent Generation Systems
Scientists have developed Reliable Consensus System (RCS) to mitigate risks in Multi-Agent Generation (MG) systems, focusing on Byzantine fault tolerance and safety. MG systems utilize multiple language models working together, and a key challenge is dealing with Byzantine agents, those that behave maliciously or unpredictably. The research emphasizes achieving both safety, avoiding harmful outputs, and liveness, ensuring the system produces outputs when needed, a balance often lacking in existing safety mechanisms. RCS employs a probabilistic approach to filter potentially harmful responses while allowing the system to converge on a valid output.
A key parameter, λ, controls the trade-off between safety and liveness, with higher values prioritizing safety and lower values prioritizing output generation. The method’s effectiveness was evaluated using metrics such as safe rate, latency, and accuracy, across varying numbers of agents and Byzantine agents, including scenarios with coordinated attacks. RCS maintains comparable latency to CS, indicating minimal performance overhead, and the feedback algorithm used to identify unsafe responses maintains high accuracy.
This scalability and robustness suggest RCS is a viable solution for building reliable MG systems in applications like chatbots and content generation platforms. Future research directions include providing more detailed information about the datasets used, specifying the safety classifier employed, and conducting real-world evaluations with users. Experiments demonstrate that RCS achieves higher safe rates than CS across varying security parameters. RCS maintains comparable latency to CS, meaning it doesn’t sacrifice speed for improved security. A key advantage of RCS is its resilience to abstention, a common issue with CS where the system fails to produce a usable response.
Further experiments involving colluding “Byzantine” models, designed to undermine the system, showed that while CS experiences a significant decline in safety, RCS maintains a high safe rate, demonstrating a robust defense against coordinated attacks. The accuracy of the feedback algorithm within RCS, which identifies unsafe models, remains high across different values of the security parameter and the number of models. The team tested RCS using models like Qwen2.5 and Qwen3Guard-Gen-8B against datasets including HarmBench and AdvBench, conducting 8,000 repeated experiments to minimize randomness. The team refined RCS with a feedback algorithm designed to continuously improve its safety performance and established theoretical guarantees confirming that the method maintains a controllable level of risk. Experiments reveal that RCS achieves a five-fold increase in safety rate compared to current methods, representing a significant advancement in the reliability of generative systems. Researchers suggest exploring alternative safety thresholds and investigating even more extreme adversarial threats as directions for future work, noting that the methods presented are intended for research purposes only.
👉 More information
🗞 Towards Provably Secure Generative AI: Reliable Consensus Sampling
🧠 ArXiv: https://arxiv.org/abs/2512.24925
