AI agents are rapidly evolving from passive chatbots into autonomous systems capable of executing complex tasks—booking flights, writing code, or managing IT infrastructure. While this autonomy is powerful, it introduces a significant security risk. If a Large Language Model (LLM) is tricked by a malicious prompt (prompt injection), it can misuse the tools at its disposal, turning a helpful assistant into a data-leaking malware vector.
Existing solutions often act like simple content filters—checking if a prompt contains “bad words.” But this isn’t enough. We need to secure the execution flow, not just the text.
In their recent paper, “AGENTGUARDIAN: Learning Access Control Policies to Govern AI Agent Behavior,” researchers from Ben Gurion University introduce a framework that learns how an agent should behave and enforces those rules in real-time.
The Problem: When Good Tools Go Bad
Imagine a “Personal Assistant Agent” designed to email meeting summaries. It has access to a Read File tool and a Send Email tool. In a normal workflow, it reads a specific document and emails it to the user.
However, via a prompt injection attack, a malicious user could trick the agent into:
- Reading a sensitive password file.
- Sending that file to an external attacker’s email.
Current “guardrails” (like Llama Guard) might scan the text, but they often fail to understand the context of the tool usage or the sequence of actions. Defining strict rules manually for every possible input is also impossible—for a travel agent, you can’t manually list every valid city in the world.
The Solution: AgentGuardian
AgentGuardian is a security framework that learns legitimate behavior by observing an agent during a “staging phase” (a safe period of normal operation). It doesn’t just filter text; it builds a comprehensive security policy covering three layers:
- Input Validation: Checks if the input matches learned patterns (e.g., Regex).
- Attribute Constraints: Validates context (e.g., time of day, processing time).
- Workflow Constraints: Ensures the agent follows a valid sequence of tool calls.
How It Works: The Architecture
The framework consists of three main components that monitor, learn, and enforce.
STAGING PHASE RUNTIME PHASE
[Agent App] [Agent App]
| |
| (1) Logs | (4) Tool Call
v v
+-------------------+ +-------------------+
| Monitoring Tool |--------->| Policy Enforcer |
+-------------------+ +-------------------+
| ^ |
| (2) Traces | | (5) Check
v | v
+-------------------+ +-------------------+
| Policy Generator |--------->| Policy Database |
+-------------------+ +-------------------+
(1) Monitoring Tool: Records execution traces (LLM inputs, tool calls).
(2) Policy Generator: Analyzes traces to build Access Control Policies.
(3) Database: Stores the learned policies and Control Flow Graphs.
(4) Enforcer: Intercepts tool calls during live operation.
(5) Decision: Allows execution if the tool, input, and sequence are valid.
1. Learning the Behavior (Policy Generation)
During the staging phase, the Monitoring Tool collects logs. The Policy Generator then processes this data to create a formal policy.
Generalizing Inputs
The framework doesn’t just list allowed inputs (e.g., “New York”, “London”). It converts text and attributes into vector embeddings, clusters similar inputs together, and generates generalized rules (like Regex patterns).
- Cluster: “New York”, “London”, “Tokyo” $\rightarrow$ Rule: “Major Cities”.
This “tightening-the-belt” principle creates strict boundaries based on what was seen during safe training.
The Control Flow Graph (CFG)
This is the core innovation. AgentGuardian builds a state machine representing how tools should be chained together. If a tool is called out of order, or in a loop that wasn’t observed during training, it is blocked.
Example: IT Support Agent Flow
[Start Task]
|
v
+--------------+
| List Files |
+--------------+
|
v
+--------------+
| Read File | <--- Valid Path
+--------------+
|
v
+--------------+
| Execute Fix |
+--------------+
|
v
[End Task]
Invalid Path Example:
[Execute Fix] --> [Send Email] <-- BLOCKED by CFG
2. Enforcing the Rules (Runtime)
Once policies are generated, the Policy Enforcer sits directly between the Agent’s logic and the tools. It validates every request against:
- The CFG: “Is
Read Fileallowed to be called right afterExecute Fix?” - Input Constraints: “Does the file path match the Regex pattern for this tool?”
- Attribute Constraints: “Is the current time within permitted working hours?”
If any check fails, the action is blocked, and the agent is halted.
Evaluation: Does It Work?
The researchers tested AgentGuardian on two real-world applications:
- Knowledge Assistant: An agent for web discovery and report generation.
- IT Support Agent: A diagnostic agent with system-level access.
The Results
Using metrics like False Acceptance Rate (FAR) and False Rejection Rate (FRR), the framework showed promising results:
- Overall FAR: 0.10 (It caught 90% of malicious/misleading inputs).
- Overall FRR: 0.10 (It only blocked 10% of valid, benign inputs, usually due to significant deviations in processing time or input length).
- Hallucination Mitigation: Interestingly, the framework also caught “Benign Execution Failures”—cases where the LLM hallucinated a non-existent file or tool. This proves that CFGs don’t just stop hackers; they stabilize the agent’s logic.
The Impact of Data Quantity
The study found that the number of samples in the staging phase matters. When generating Regex patterns:
- With 10 samples, the policy was too loose (matched any free text).
- With 60 samples, the policy became tight and specific, matching only the intended file structures.
Regex Quality vs. Sample Size
10 Samples: ".*" (Accepts anything - Dangerous)
|
v
60 Samples: "^/Cars/.*\\.txt$" (Strict path matching - Safe)
Why This Matters
AgentGuardian represents a shift from reactive filtering to proactive governance. By combining ABAC (Attribute-Based Access Control) with Control Flow Graphs, it provides a three-layer defense:
- Input Level: What data is coming in?
- Context Level: When and how is it coming in?
- Orchestration Level: Is the sequence of actions logical?
While automated policy generation remains challenging (specifically handling rare but valid inputs), this framework offers a path toward making autonomous AI agents safe enough for enterprise deployment.
References
- Abbaev, N., Klimov, D., Levinov, G., Mimran, D., Elovici, Y., & Shabtai, A. (2026). AGENTGUARDIAN: Learning Access Control Policies to Govern AI Agent Behavior. arXiv preprint arXiv:2601.10440.
- Llama Guard. Inan, H., et al. (2023). LLM-based input-output safeguard for human-AI conversations.
- Gartner. (2024). Emerging Technology Analysis: AI Agents and Security Controls.
- Progent. Shi, T., et al. (2025). Programmable privilege control for LLM agents.
- SafeFlow. Li, P., et al. (2025). A principled protocol for trustworthy and transactional autonomous agent systems.
- CaMeL. (2025). Separates trusted execution flow from untrusted context.