AI agents are rapidly evolving from passive chatbots into autonomous systems capable of executing complex tasks—booking flights, writing code, or managing IT infrastructure. While this autonomy is powerful, it introduces a significant security risk. If a Large Language Model (LLM) is tricked by a malicious prompt (prompt injection), it can misuse the tools at its disposal, turning a helpful assistant into a data-leaking malware vector.

Existing solutions often act like simple content filters—checking if a prompt contains “bad words.” But this isn’t enough. We need to secure the execution flow, not just the text.

In their recent paper, “AGENTGUARDIAN: Learning Access Control Policies to Govern AI Agent Behavior,” researchers from Ben Gurion University introduce a framework that learns how an agent should behave and enforces those rules in real-time.

The Problem: When Good Tools Go Bad

Imagine a “Personal Assistant Agent” designed to email meeting summaries. It has access to a Read File tool and a Send Email tool. In a normal workflow, it reads a specific document and emails it to the user.

However, via a prompt injection attack, a malicious user could trick the agent into:

  1. Reading a sensitive password file.
  2. Sending that file to an external attacker’s email.

Current “guardrails” (like Llama Guard) might scan the text, but they often fail to understand the context of the tool usage or the sequence of actions. Defining strict rules manually for every possible input is also impossible—for a travel agent, you can’t manually list every valid city in the world.

The Solution: AgentGuardian

AgentGuardian is a security framework that learns legitimate behavior by observing an agent during a “staging phase” (a safe period of normal operation). It doesn’t just filter text; it builds a comprehensive security policy covering three layers:

  1. Input Validation: Checks if the input matches learned patterns (e.g., Regex).
  2. Attribute Constraints: Validates context (e.g., time of day, processing time).
  3. Workflow Constraints: Ensures the agent follows a valid sequence of tool calls.

How It Works: The Architecture

The framework consists of three main components that monitor, learn, and enforce.

       STAGING PHASE                     RUNTIME PHASE

    [Agent App]                      [Agent App]
        |                               |
        | (1) Logs                      | (4) Tool Call
        v                               v
+-------------------+          +-------------------+
|  Monitoring Tool  |--------->| Policy Enforcer  |
+-------------------+          +-------------------+
        |                               ^    |
        | (2) Traces                    |    | (5) Check
        v                               |    v
+-------------------+          +-------------------+
| Policy Generator  |--------->| Policy Database  |
+-------------------+          +-------------------+

(1) Monitoring Tool: Records execution traces (LLM inputs, tool calls).
(2) Policy Generator: Analyzes traces to build Access Control Policies.
(3) Database: Stores the learned policies and Control Flow Graphs.
(4) Enforcer: Intercepts tool calls during live operation.
(5) Decision: Allows execution if the tool, input, and sequence are valid.

1. Learning the Behavior (Policy Generation)

During the staging phase, the Monitoring Tool collects logs. The Policy Generator then processes this data to create a formal policy.

Generalizing Inputs

The framework doesn’t just list allowed inputs (e.g., “New York”, “London”). It converts text and attributes into vector embeddings, clusters similar inputs together, and generates generalized rules (like Regex patterns).

This “tightening-the-belt” principle creates strict boundaries based on what was seen during safe training.

The Control Flow Graph (CFG)

This is the core innovation. AgentGuardian builds a state machine representing how tools should be chained together. If a tool is called out of order, or in a loop that wasn’t observed during training, it is blocked.

      Example: IT Support Agent Flow

        [Start Task]
            |
            v
     +--------------+
     |  List Files  |
     +--------------+
            |
            v
     +--------------+
     |  Read File   | <--- Valid Path
     +--------------+
            |
            v
     +--------------+
     | Execute Fix  |
     +--------------+
            |
            v
        [End Task]

    Invalid Path Example:
    [Execute Fix] --> [Send Email]  <-- BLOCKED by CFG

2. Enforcing the Rules (Runtime)

Once policies are generated, the Policy Enforcer sits directly between the Agent’s logic and the tools. It validates every request against:

  1. The CFG: “Is Read File allowed to be called right after Execute Fix?”
  2. Input Constraints: “Does the file path match the Regex pattern for this tool?”
  3. Attribute Constraints: “Is the current time within permitted working hours?”

If any check fails, the action is blocked, and the agent is halted.

Evaluation: Does It Work?

The researchers tested AgentGuardian on two real-world applications:

  1. Knowledge Assistant: An agent for web discovery and report generation.
  2. IT Support Agent: A diagnostic agent with system-level access.

The Results

Using metrics like False Acceptance Rate (FAR) and False Rejection Rate (FRR), the framework showed promising results:

The Impact of Data Quantity

The study found that the number of samples in the staging phase matters. When generating Regex patterns:

Regex Quality vs. Sample Size

10 Samples:  ".*"  (Accepts anything - Dangerous)
             |
             v
60 Samples:  "^/Cars/.*\\.txt$" (Strict path matching - Safe)

Why This Matters

AgentGuardian represents a shift from reactive filtering to proactive governance. By combining ABAC (Attribute-Based Access Control) with Control Flow Graphs, it provides a three-layer defense:

  1. Input Level: What data is coming in?
  2. Context Level: When and how is it coming in?
  3. Orchestration Level: Is the sequence of actions logical?

While automated policy generation remains challenging (specifically handling rare but valid inputs), this framework offers a path toward making autonomous AI agents safe enough for enterprise deployment.

References

  1. Abbaev, N., Klimov, D., Levinov, G., Mimran, D., Elovici, Y., & Shabtai, A. (2026). AGENTGUARDIAN: Learning Access Control Policies to Govern AI Agent Behavior. arXiv preprint arXiv:2601.10440.
  2. Llama Guard. Inan, H., et al. (2023). LLM-based input-output safeguard for human-AI conversations.
  3. Gartner. (2024). Emerging Technology Analysis: AI Agents and Security Controls.
  4. Progent. Shi, T., et al. (2025). Programmable privilege control for LLM agents.
  5. SafeFlow. Li, P., et al. (2025). A principled protocol for trustworthy and transactional autonomous agent systems.
  6. CaMeL. (2025). Separates trusted execution flow from untrusted context.