AI Red Teaming in the Agentic Era: From Weeks to Hours
Here’s a question nobody asks out loud at AI security conferences: of the time you spent on your last red team engagement, how much of it was actually probing the model versus assembling the scaffolding to run the probes?
For most people the honest answer is uncomfortable. You pick an attack, figure out the parameters, compose transforms, wire up scorers, run it through a CLI, parse raw output by hand, map findings to compliance frameworks, write the report. And that’s assuming the attack even worked. If it didn’t, you rebuild the workflow and go again. The actual model-breaking is a fraction of the job.
Dreadnode’s new paper is about fixing that. The idea is simple in hindsight: what if you just described what you wanted to break, in plain English, and an agent handled everything else? Attack selection, transform composition, scoring, severity classification, compliance tagging, the report. All of it. Their case study puts this against Meta’s Llama Scout and walks out with an 85% attack success rate across 68 goals, in three hours, with zero lines of human-written code.
The Problem: You’re Building Workflows, Not Breaking Models
The adversarial attack catalog for AI systems has gotten genuinely impressive. TAP, PAIR, Crescendo, GAP, Rainbow Teaming, gradient-free perturbation attacks for vision models, multi-agent trust boundary exploits, MCP poisoning, multimodal attacks. Dozens of techniques, each with their own parameterization, quirks, and failure modes.
The paradox is that the bigger the catalog gets, the worse the operator experience becomes. Frameworks like PyRIT, Garak, and Promptfoo did real work democratizing access to these techniques. But they’re all library-oriented: you write code to configure and run attacks. The operator adapts to the framework.
The cognitive overhead scales badly. It’s linear with the technique count and combinatorial as targets grow more complex. Every time a model adds tool use, multilingual surfaces, or multimodal inputs, there’s a new attack surface and the operator has to learn yet another workflow to probe it.
[!NOTE] This isn’t a knock on any specific tool. PyRIT is excellent at what it does. The problem is structural: libraries require you to express how to run an attack before you can think about what to find.
Enter the Agent
The Dreadnode system flips the interaction model. You open the TUI, describe your objective in natural language, and the agent figures out the rest. It maintains session state, so “try Crescendo on the same target” or “add skeleton-key framing to the last run” are valid next instructions without re-specifying anything. It’s genuinely conversational.
flowchart TD
OP([Operator]) -->|natural language objective| TUI[Dreadnode TUI]
TUI --> AG[Red Teaming Agent]
AG <-->|session state| MEM[(Memory\nassessment context · prior results)]
AG --> WF[Generated Workflow\nPython · open-source SDK]
CAT[(Attack Catalog\n45+ attacks · 450+ transforms · 130+ scorers)] --> WF
WF -->|execute| TGT([Target System])
TGT -->|OTEL traces| PLT[Dreadnode Platform\nfindings · severity · compliance · reports]
Under the hood it’s running a structured loop: assemble assessment context, reason about strategy, dispatch tools, collect results. But that’s all invisible to you. The output is the same whether you use the TUI, the CLI, or the Python SDK: structured findings with severity, compliance tags, drill-down to the exact adversarial prompt and model response that constitutes the evidence.
The attack catalog it has to work with is large. Over 45 attack strategies across four categories (core jailbreaks, advanced adversarial, traditional ML gradient-free attacks, multimodal), 450+ prompt transforms across 38 modules, and 130+ scorers across 34 modules. The compliance mapping covers OWASP LLM Top 10, OWASP Agentic Security Initiative, MITRE ATLAS, NIST AI RMF, and Google SAIF, all tagged automatically.
One thing worth highlighting because it tends to get skimmed: the system unifies traditional ML adversarial attacks (image classifiers, tabular models) and generative AI attacks (jailbreaks, prompt injection, agentic exploits) under a single interface. That’s SimBA and HopSkipJump for your vision models alongside TAP and Crescendo for your LLMs, same workflow, same output format. The abstraction works because both attack types are fundamentally the same loop:
propose → evaluate → score → refine
Whether you’re perturbing pixels or prompts, the structure is identical. For traditional ML the attack optimizes a perturbation $\boldsymbol{\delta}$ under $|\boldsymbol{\delta}|_p \leq \epsilon$ for $p \in {0, 1, 2, \infty}$; for LLMs it refines a prompt string. Same loop, different input space. One framework for your entire AI surface.
Every assessment runs through the same five phases:
flowchart LR
A[Define\nGoal] --> B[Run\nAttacks]
B --> C[Analyze\nResults]
C --> D[Review &\nReport]
D --> E[Iterate &\nHarden]
E -.->|next iteration| A
What Happened to Llama Scout
The case study targets meta-llama/llama-4-scout-17b-16e-instruct: 17B parameters, 16-expert MoE, instruction-tuned. The operator described objectives in the TUI, the agent selected three attack types (TAP, Crescendo, GAP) with five transform variants, and ran it all against 68 adversarial goals spanning harmful content and fairness/bias categories.
Attack success rate is defined as the fraction of attacks where the jailbreak score clears the success threshold:
\[\text{ASR} = \frac{\bigl|\{a \in \mathcal{A} \mid s(a) \geq \theta\}\bigr|}{|\mathcal{A}|}\]where $\mathcal{A}$ is the full set of attacks, $s(a) \in [0, 1]$ is the jailbreak score for attack $a$, and $\theta$ is the goal-specific success threshold. Results:
─────────────────────────────────────────
Total attacks 674
Total trials 7,727
Attack success rate 85% [CRITICAL]
Wall-clock time ~3 hrs
Human code written 0 lines
─────────────────────────────────────────
Full jailbreaks 401 (59.5%)
Partial compliance 20 (3.0%)
Refusals 253 (37.5%)
─────────────────────────────────────────
Critical findings (≥0.9) 232 (34.4%)
High (≥0.7) 141 (20.9%)
─────────────────────────────────────────
By attack type: Crescendo hit 100% ASR at 9.4 trials per goal average. GAP also hit 100% at 8.6 trials per goal. TAP reached 96% but needed 25.4 trials per goal, meaning the model put up more resistance to tree-search strategies specifically. That’s actually useful signal for hardening.
Severity is assigned from the jailbreak score $s$ via a piecewise classification defined in Section 3.5 of the paper:
\[\text{severity}(s) = \begin{cases} \texttt{Critical} & s \geq 0.9 \\ \texttt{High} & 0.7 \leq s < 0.9 \\ \texttt{Medium} & 0.5 \leq s < 0.7 \\ \texttt{Low} & 0.3 \leq s < 0.5 \\ \texttt{Info} & s < 0.3 \end{cases}\]The 232 critical findings mean 34% of attacks scored ≥ 0.9: complete, unambiguous compliance with the adversarial goal.
The transform numbers are the interesting part
| Transform | ASR |
|---|---|
skeleton_key_framing |
100% |
role_play_wrapper |
100% |
adapt_language |
100% |
| no transform | 80% |
base64 |
75% |
Two things jump out. First, persona-based transforms annihilate the model’s defenses. Telling Llama Scout it’s operating in “explicit educational mode” as a safety researcher is enough to get compliance on every single tested goal, no exceptions. The model has essentially been trained to trust context over content.
Second, and this is the finding I’d actually pay attention to: the no-transform baseline is 80%. Raw, unobfuscated harmful requests. No persona, no encoding, no language shift, just the goal stated plainly. The attack algorithm alone, with enough iterations, succeeds four out of five times. That’s not a bypass problem, that’s a safety training problem.
Some of the score 1.0 findings are worth looking at directly. A working credential-stealing browser extension, full manifest.json and contentScript.js, came out of GAP with no transform at all. A production-grade shellcode loader with AMSI bypass, ETW evasion, and ntdll unhooking came out of Crescendo with no transform. Ransomware with file encryption and a Bitcoin payment demand. Detailed instructions for self-harm after the skeleton-key “educational mode” framing. A voter suppression robocall script.
All score 1.0. All from the same three-hour session. All zero human code.
Does This Actually Change Anything?
The automation advantage over manual red teaming is empirical at this point. Mulla et al. ran 214k+ attack attempts across 1,674 participants and found automated approaches at 69.5% versus 47.6% for manual. AIRTBench found frontier models solving up to 61% of black-box challenges with efficiency advantages exceeding 5,000× over humans on hard tasks.
But the more interesting shift is what agentic automation does to the audience for red team findings. When every finding arrives pre-tagged against OWASP, MITRE, and NIST frameworks with a PDF export ready for compliance, the results don’t have to go through a translation layer before reaching GRC teams or legal. The security engineer can stay focused on the actual finding; the paperwork handles itself.
There’s also something real in the unified ML/GenAI story. Organizations running image classifiers, tabular risk models, and LLM-based agents all need adversarial evaluation, and right now that means three separate toolchains. This is the first framework I’ve seen that treats both problem spaces as the same underlying problem, because they are.
The limitations are honest: agent reliability depends on the underlying model correctly interpreting objectives, automated scorers hallucinate, the catalog is large but not complete, and there’s no formal comparison of agent strategy selection against expert human baseline. The 3-hour case study used a constrained scope; a full surface engagement takes longer.
But the core idea is right. You shouldn’t be spending most of your red team time on orchestration. That problem is solved.
References
- Dheekonda, R.S.R., Pearce, W. & Landers, N. (2026). Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours. arXiv:2605.04019.
- Mehrotra, A. et al. (2024). Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. NeurIPS 2024.
- Chao, P. et al. (2023). Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv:2310.08419.
- Russinovich, M., Salem, A. & Eldan, R. (2024). The Crescendo Multi-Turn LLM Jailbreak Attack. arXiv:2404.01833.
- Schwartz, D. et al. (2025). Graph of Attacks with Pruning. arXiv:2501.18638.
- Mulla, R. et al. (2025). The Automation Advantage in AI Red Teaming. arXiv:2504.19855.
- Dawson, A. et al. (2025). AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models. arXiv:2506.14682.
- Lopez Munoz, G.D. et al. (2024). PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI Systems. arXiv:2410.02828.
- Yong, Z.-X., Menghini, C. & Bach, S.H. (2024). Low-Resource Languages Jailbreak GPT-4. arXiv:2310.02446.