Automated red-teaming represents a crucial step towards robust AI safety, yet current methods often remain constrained by human-defined parameters. Jiayi Yuan, Jonathan Nöther, and Natasha Jaques, from the University of Washington and the Max Planck Institute for Software Systems, alongside Goran Radanović et al, present AgenticRed, a novel pipeline which overcomes these limitations by utilising large language models to autonomously design and refine red-teaming systems. Unlike existing approaches, AgenticRed treats vulnerability exposure as a system design problem, evolving agentic systems through an innovative evolutionary selection procedure.This results in significantly improved performance , achieving up to 96% attack success rate on Llama-2-7B and demonstrating strong transferability to proprietary models like GPT-4o-mini and Claude-Sonnet-3.5. This research underscores the potential of automated system design to deliver AI safety evaluations that can adapt to the ever-increasing sophistication of modern models.

Evolving agentic systems for automated red-teaming

Scientists have unveiled AgenticRed, a novel automated pipeline that revolutionises red-teaming by leveraging large language models (LLMs) to iteratively design and refine systems without human intervention. This breakthrough addresses a critical limitation of existing automated red-teaming methods, which often rely on manually specified workflows prone to human biases and expensive exploration of the design space. Rather than simply optimising attacker policies within fixed structures, the research team treats red-teaming as a fundamental system design problem, inspired by techniques like Agent Search. The core innovation lies in a novel procedure for evolving agentic systems through evolutionary selection, directly applied to the challenge of automatic red-teaming, yielding substantial performance gains.

Experiments demonstrate that red-teaming systems designed by AgenticRed consistently outperform state-of-the-art approaches, achieving an impressive 96% attack success rate (ASR) on Llama-2-7B, a significant 36% improvement over previous methods. Furthermore, the system achieves a remarkable 98% ASR on Llama-3-8B when tested against the HarmBench benchmark. This work doesn’t stop there; AgenticRed exhibits exceptional transferability to proprietary models, attaining 100% ASR on both GPT-3.5-Turbo and GPT-4o-mini, and a strong 60% ASR on Claude-Sonnet-3.5, representing a 24% improvement. The team achieved these results by framing the task as a reasoning problem, and developing red-teaming systems as a system design problem.

The researchers developed a pipeline that systematically explores design variations and selects promising designs based on performance metrics, effectively automating the process of creating robust red-teaming systems. This approach moves beyond traditional agentic systems with manually-designed workflows, which are susceptible to human biases and limit the exploration of potential solutions. By employing evolutionary selection, AgenticRed can discover novel and effective red-teaming strategies that might be overlooked by human designers. The system’s ability to achieve perfect attack success rates on leading proprietary models underscores its potential for real-world applications in AI safety evaluation.

This work highlights the power of automated system design as a paradigm for AI safety evaluation, offering a scalable and adaptable solution that can keep pace with the rapidly evolving landscape of large language models. The team’s methodology provides a framework for continuously improving red-teaming capabilities, ensuring that AI systems are rigorously tested and aligned with safety standards. The project website, available at https://yuanjiayiy. github. io/AgenticRed, provides further details and resources for those interested in exploring this innovative approach to AI safety. This research establishes a new direction for automated red-teaming, paving the way for more robust and reliable AI systems.

Evolving Agentic Systems for Automated Red-Teaming

Scientists pioneered AgenticRed, an automated pipeline leveraging Large Language Models (LLMs) to iteratively design and refine red-teaming systems without human intervention.Rather than optimising attacker policies within fixed structures, the study treated red-teaming as a system design problem, fundamentally shifting the approach to AI safety evaluation. Inspired by Meta Agent Search, researchers developed a novel procedure for evolving agentic systems using evolutionary selection, directly applying it to automatic red-teaming challenges. The core of the methodology involved creating an archive of agentic red-teaming systems, each defined by code, and then employing a search process to generate new candidate systems, labelled A, B, and C in the work, based on variations of existing designs.

Experiments employed a malicious intent dataset, denoted as ‘D’, to provide a range of target intents for red-teaming attacks, generating prompts from this dataset to evaluate system performance. Each newly designed agentic system underwent rigorous evaluation against target Language Models (LMs), with a judge function determining attack success. The team harnessed LLMs not only for generating attack prompts but also for evaluating the performance of the evolving red-teaming systems, creating a closed-loop optimisation process. If an error occurred during evaluation, the system was sent back for refinement, ensuring continuous improvement over multiple generations, specifically, ‘N’ generations, of the evolutionary algorithm.

This innovative approach achieved a 96% attack success rate (ASR) on Llama-2-7B, representing a 36% improvement over state-of-the-art methods, and an impressive 98% on Llama-3-8B on the HarmBench benchmark. Demonstrating strong transferability, AgenticRed attained 100% ASR on both GPT-3.5-Turbo and GPT-4o-mini, and a substantial 60% ASR on Claude-Sonnet-3.5, a 24% improvement over previous techniques. The system delivers a significant advancement in automated red-teaming, enabling the creation of more robust and secure AI models by systematically exploring a broader design space than manual approaches allow. This work highlights the potential of automated system design to keep pace with the rapid evolution of LLMs and address critical AI safety concerns.

AgenticRed pipeline boosts LLM red-teaming success significantly

Scientists have developed AgenticRed, a novel automated pipeline leveraging large language models (LLMs) to iteratively design and refine red-teaming systems without human intervention. This research introduces a paradigm shift, treating red-teaming not as policy optimisation, but as a comprehensive system design problem. Inspired by Agent Search methods, the team created a procedure for evolving agentic systems using evolutionary selection, specifically tailored for automatic red-teaming applications. Experiments revealed that systems designed by AgenticRed consistently outperform state-of-the-art approaches, achieving a remarkable 96% attack success rate (ASR) on Llama-2-7B, representing a 36% improvement over previous methods.

The breakthrough delivers a 98% ASR on Llama-3-8B when tested against the HarmBench dataset, demonstrating robust performance across different model sizes. Measurements confirm strong transferability to proprietary models, with AgenticRed achieving 100% ASR on both GPT-3.5-Turbo and GPT-4o-mini. Further tests prove the system’s adaptability, attaining a 60% ASR on Claude-Sonnet-3.5, a substantial 24% improvement compared to existing techniques. These quantitative results, alongside qualitative analyses of the discovered agent workflows, illustrate a new approach to red-teaming system design. Researchers measured attack success rates across diverse configurations, consistently demonstrating superior performance with automatically designed red-teaming systems.

The team’s approach is particularly suitable for the rapidly evolving landscape of AI safety research, serving as a scalable oversight technique that enables AI systems to learn and improve with minimal human supervision. As new models are continuously released, AgenticRed offers a promising solution for maintaining pace with this progress. This work highlights automated system design as a powerful paradigm for AI safety evaluation, capable of adapting to rapidly evolving models. The study is, to the researchers’ knowledge, the first to apply automated agentic system design to a real-world scientific challenge, opening avenues for meaningful scientific discovery. Data shows that AgenticRed’s evolutionary structure systematically explores and refines architectural variations, leading to consistently high ASRs across various LLMs. The system’s ability to achieve 100% ASR on GPT-3.5-Turbo and GPT-4o-mini underscores its potential for robust and reliable AI safety assessments.

AgenticRed boosts LLM red-teaming via evolution

Researchers have developed AgenticRed, an automated pipeline that designs and refines red-teaming systems for large language models (LLMs) without human intervention. This innovative approach treats red-teaming, the process of identifying vulnerabilities, as a system design problem, utilising evolutionary selection to create increasingly effective agentic systems. Unlike existing methods reliant on manually designed workflows, AgenticRed leverages the in-context learning capabilities of LLMs to autonomously explore a broader range of attack strategies. The findings demonstrate substantial performance improvements across several LLMs; AgenticRed achieved a 96% attack success rate on Llama-2-7B, a 36% improvement over prior state-of-the-art approaches, and 98% on Llama-3-8B on HarmBench.

Furthermore, the system exhibited strong transferability, attaining 100% success on GPT-3.5-Turbo and GPT-4o-mini, and 60% on Claude-Sonnet-3.5, representing a 24% improvement on the latter. The authors acknowledge that the performance gains are benchmark-specific and may vary with different models or evaluation metrics. Future research will focus on scaling the system to even larger models and exploring more complex red-teaming scenarios. This work establishes automated system design as a promising paradigm for AI safety evaluation, offering a means to keep pace with the rapid advancement of LLMs and proactively identify potential risks.

👉 More information
🗞 AgenticRed: Optimizing Agentic Systems for Automated Red-teaming
🧠 ArXiv: https://arxiv.org/abs/2601.13518

Tags:

AgenticRed automatic system design evolutionary selection GPT-3.5-turbo GPT-4o-mini! HarmBench LLaMA-2-7B Llama-3-8B LLMs red-teaming

Agenticred Achieves 98% Vulnerability Exposure with Automated System Design

Evolving agentic systems for automated red-teaming

Evolving Agentic Systems for Automated Red-Teaming

AgenticRed pipeline boosts LLM red-teaming success significantly

AgenticRed boosts LLM red-teaming via evolution

Rohail T.

Latest Posts by Rohail T.:

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

Protected: Silicon Unlocks Potential for Long-Distance Quantum Communication Networks