Agentguardian Achieves AI Safety with Adaptive Policies Governing Two-Step Actions

Artificial intelligence systems are becoming ubiquitous, automating tasks and influencing decisions across numerous sectors, yet ensuring these systems operate within defined boundaries remains a critical challenge. Nadya Abaev, Denis Klimov, and Gerard Levinov, from Ben Gurion University of the Negev, alongside et al., address this problem with their new research into governing agent behaviour. Their work introduces AgentGuardian, a security framework designed to enforce context-aware access control policies on AI agents, learning legitimate actions through monitored execution traces. This innovative approach not only detects malicious inputs but also proactively mitigates errors stemming from unpredictable ‘hallucinations’ and orchestration failures, representing a significant step towards trustworthy and reliable AI. Evaluation using real-world applications confirms AgentGuardian’s effectiveness in preserving functionality while enhancing security.

Appropriate governance is essential for maintaining system integrity and preventing misuse. This study introduces AgentGuardian, a novel security framework that governs and protects AI agent operations by enforcing context-aware access-control policies. During a controlled staging phase, the framework monitors execution traces to learn legitimate agent behaviours and input patterns. From this phase, it derives adaptive policies that regulate tool calls made by the agent, guided by both real-time input context and the control flow dependencies of multi-step agent actions. Evaluation across two real-world AI agent applications demonstrates that AgentGuardian effectively detects malicious or mis-.

LLM Guardrail Evaluation Against Prompt Injection

The provided source content is an extensive technical report on the evaluation of various guardrails and security mechanisms designed to protect large language models (LLMs) from malicious inputs, particularly focusing on prompt injection attacks. Here’s a summary and analysis of the key points: ### Summary 1. Introduction: – The report introduces the concept of “guardrails” for LLMs, which are mechanisms to ensure that model outputs remain safe and aligned with ethical standards. – It discusses various guardrail approaches, including input filtering, policy enforcement, and runtime detection0.2. Related Work: – A detailed comparison is provided between different guardrail systems, categorized by their principles (CFG Policies Input Generalization), whether they use CFG policies, how they handle inputs, and their generalization capabilities. – The table includes systems like Llama Guard, Llama Prompt Guard, Amazon Bedrock Guardrails, NVIDIA NeMo Guardrails, LLM-Guard, GenTel-Safe, Progent, CaMeL, f-secure (IFC), AgentSight0.3.

Evaluation Dataset: – The report outlines a dataset used for evaluating the effectiveness of these guardrails. – It includes benign samples (e. g., research topics) and malicious samples (e. g., system commands) categorized by their use cases in knowledge assistants and IT assistants. ### Analysis 1. Guardrail Systems Overview: – Llama Guard: A lightweight classifier that screens inputs/outputs against a safety taxonomy. – Llama Prompt Guard: Compact classifiers for detecting jailbreaks/prompt injections with minimal latency. – Amazon Bedrock Guardrails: Service layer enforcing content and attack filters on prompts/outputs. – NVIDIA NeMo Guardrails: Declarative “rails” defining allowed topics/behaviors at orchestration time. – LLM-Guard (Protect AI): Toolkit for detecting, redacting, and sanitizing PII, secrets, injections via rule plug-ins and ML detectors0.2. Generalization Capabilities: – The report emphasizes the importance of generalization in handling unseen attacks or obfuscated inputs. – Systems like Llama Prompt Guard and LLM-Guard are noted for their ability to generalize across attack patterns but may still be sensitive to novel evasions0.3. System-Level Approaches: – Progent: Uses a DSL for fine-grained privileges over tools/data, with deterministic enforcement of least privilege. – CaMeL: Trusted plans separated from untrusted context, using capability-based sandboxes with provenance tracking. – f-secure (IFC): Context-aware planner emitting structured executable plans, with an IFC monitor blocking untrusted sources0.4.

Evaluation Dataset: – The dataset is divided into benign and malicious samples for both knowledge assistants and IT assistants. – Benign samples include research topics like quantum machine learning or drug discovery. – Malicious samples involve system commands that could be used to exploit vulnerabilities, such as listing domain users or running diagnostics. ### Conclusion The report provides a comprehensive overview of various guardrail mechanisms designed to protect LLMs from malicious inputs. It highlights the importance of generalization and robustness in these systems while also emphasizing the need for fine-grained control and transparency. The evaluation dataset serves as a practical tool for assessing the effectiveness of different approaches. ### Recommendations 1. Enhance Generalization: Systems should be designed to handle unseen attacks effectively, especially those that involve obfuscation or novel evasion techniques0.2.

Transparency and Audibility: Mechanisms should provide clear and interpretable rules, making it easier to understand how decisions are made0.3. Continuous Learning: Lifelong learning capabilities can help systems adapt to new threats over time. This report is valuable for researchers, developers, and security professionals working on protecting LLMs from malicious inputs.

AgentGuardian Detects Malicious Input via Control Flow

Scientists achieved a significant breakthrough in artificial intelligence security with the development of AgentGuardian, a novel framework designed to govern and protect operations through context-aware access-control policies. The research details a system capable of learning legitimate behaviors and input patterns during a controlled staging phase, subsequently deriving adaptive policies that regulate tool calls made by the . Experiments revealed the framework’s ability to effectively detect malicious or misleading inputs while maintaining normal functionality, demonstrating a robust defense against potential misuse. The team measured the performance of AgentGuardian through the construction of a Control Flow Graph (CFG), representing all possible valid execution traces of the .

This CFG is defined as a directed graph, G = (V, E), where V encompasses all occurrences of tools applied during execution sequences and E captures valid transitions between those tools. The work demonstrates that any execution pathway not present in the graph is considered invalid, effectively bounding the scope of agent operations and reducing the risk of unintended actions. This control-flow-based governance mechanism successfully mitigates hallucination-driven errors and other orchestration-level malfunctions. Results demonstrate the system’s ability to cluster inputs and their attributes into a shared embedding space, enabling unified similarity analysis and generalization over input patterns.

The embedding scheme transforms inputs into a comprehensive 150D feature vector, incorporating textual inputs, numeric attributes, and contextual information. Specifically, textual components like ‘Thoughts’, ‘Tool type’, and ‘Tool input’ are represented by 32D, 16D, and 64D vectors respectively, while numeric features are scaled to the [0, 1] range. This approach significantly reduces policy complexity by preventing the proliferation of excessively large and impractical policy sets. Further analysis involved clustering these embedded representations to capture semantically similar input patterns.

For example, the study showed how inputs like “New York,” “Washington,” and “Chicago” could be grouped under a broader category like “Cities in the USA,” enhancing policy generalization and reducing false positives. The team formally defined this process, partitioning the embedded input space into KT clusters, denoted as CT,1 through CT,KT, ensuring that each cluster is disjoint and collectively encompasses all observed inputs. This clustering-based generalization, unlike existing methods, delivers scalable and adaptive policy generation.

👉 More information
🗞 AgentGuardian: Learning Access Control Policies to Govern AI Agent Behavior
🧠 ArXiv: https://arxiv.org/abs/2601.10440

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Limit on Energy and Entropy Revealed for Wave Packets

Quantum Limit on Energy and Entropy Revealed for Wave Packets

February 11, 2026
Space Station Data Unlocks Clearer View of Solar Storms and Radio Disruption

Space Station Data Unlocks Clearer View of Solar Storms and Radio Disruption

February 11, 2026
Complex Equations Now Have Weaker Solution Criteria, Boosting Modelling Power

Complex Equations Now Have Weaker Solution Criteria, Boosting Modelling Power

February 11, 2026