In the rapidly evolving landscape of Artificial Intelligence, Large Language Models (LLMs) are becoming the backbone of critical infrastructure—from healthcare and finance to education. But with great power comes great responsibility, and ensuring these models are safe and aligned is a monumental challenge.
This is where Red-teaming comes in: the practice of systematically probing AI systems to find vulnerabilities before malicious actors do. Traditionally, this relies on humans manually writing prompts to trick the model. More recently, automated methods have emerged, but they often rely on rigid, human-designed workflows.
Today, we’re diving into a groundbreaking new paper titled “AGENTICRED: Optimizing Agentic Systems for Automated Red-teaming.” This research proposes a paradigm shift: instead of humans designing the attack strategies, what if we let the AI design the attack systems themselves?
The Problem: Human Bias in Automated Attacks
Most current state-of-the-art (SOTA) automated red-teaming methods use “agentic systems”—multi-step workflows where an LLM plays different roles (like an attacker and a verifier) to break a target model.
The problem? These workflows are manually designed. They are expensive to build, suffer from human biases, and struggle to explore the vast design space of possible attack strategies. As models get smarter, these static workflows are falling behind.
The Solution: AGENTICRED
AGENTICRED treats red-teaming not just as a prompt optimization problem, but as a System Design Problem.
Inspired by evolutionary algorithms and Darwin’s theory of “survival of the fittest,” AGENTICRED uses a “Meta Agent” (a powerful LLM) to iteratively write, test, and refine code for red-teaming agents.
How It Works: The Evolutionary Loop
The process creates a cycle of continuous improvement. Here is a conceptual ASCII diagram of the architecture:
+------------------------+
| The ARCHIVE (Start) | <-- Contains best systems & metrics
+-----------+------------+
|
| Inspiration
v
+------------------------+
| The META AGENT | <-- Generates new agentic code
| (The Architect LLM) |
+-----------+------------+
|
| Generates "Offspring"
v
+------------------------+ +----------------------+
| New Agentic Systems | ---> | EVALUATION PHASE |
| (Multiple Candidates)| | (Attack Target LM) |
+------------------------+ +----------+-----------+
|
| ASR Score
v
+------------------------+ +----------+-----------+
| Survival Check | <--- | Evolutionary Filter|
| (Keep the Fittest) | | (Select Best One) |
+-----------+------------+ +----------------------+
|
| Add to Archive
v
(Loop continues for N generations...)
Key Components
- The Archive: Instead of starting from scratch, AGENTICRED begins with a “seed” archive of existing methods (like Self-Refine or JudgeScore-Guided Adversarial Reasoning).
- Evolutionary Pressure: The Meta Agent generates multiple new systems per generation. They are tested on a small dataset, and only the best-performing one (the “fittest”) survives to the next round.
- Helper Functions: The Meta Agent is given special tools to query the target model and check the “Judge” function (the system that decides if a jailbreak was successful).
The Results: Unprecedented Success Rates
The results from the AGENTICRED framework are staggering. The system was tested against open-weight models (Llama) and proprietary models (GPT, Claude).
Performance Comparison
The following table shows the Attack Success Rate (ASR) of AGENTICRED compared to previous SOTA methods on the HarmBench dataset.
| Agentic System | Llama-2-7B | Llama-3-8B | GPT-3.5-Turbo | GPT-4o | Claude-Sonnet-3.5 |
|---|---|---|---|---|---|
| AdvReasoning (SOTA) | 60% | 88% | - | 86% | 36% |
| AutoDAN-Turbo | 36% | 62% | 90% | - | 12% |
| AGENTICRED | 96% | 98% | 100% | 100% | 60% |
Visualizing the Progress
One of the most compelling aspects of AGENTICRED is how quickly it learns. Below is an ASCII representation of the ASR improvement over generations when targeting Llama-2-7B.
ASR Performance Over Generations (Target: Llama-2-7B)
100% | ########
90% | #######
80% | #######
70% | #######
60% | ####### <--- Baseline (AdvReasoning ~60%)
50% | #######
40% | #
30% | #
20% | #
10% | #
0% +--------------------------------------------
G1 G2 G3 G4 G5 G6 G7 G8 G9 G10
Note: AGENTICRED surpassed the SOTA baseline by Generation 2 and reached 96% by Generation 6.
The “Magic”: Emergent Strategies
The most fascinating finding isn’t just the high score—it’s how the AI achieved it. The researchers didn’t program these strategies; the Meta Agent discovered them on its own by analyzing the archive and the target model’s failures.
The evolved agent code showed emergent behaviors, including:
- Reward Shaping: The AI automatically learned to modify its loss function to penalize refusal phrases (like “I cannot help you”) and reward specific prefixes.
- Refusal Suppression: It created a blacklist of refusal phrases and explicitly filtered them out.
- Genetic Crossover: The agent learned to take the first half of a successful prompt and combine it with the second half of another successful prompt to create a “child” prompt.
Here is a snippet of the Python-style code the Meta Agent wrote to perform “Crossover” (simulating evolution):
# Code produced by AGENTICRED autonomously
def crossover(a: str, b: str) -> str:
a_mid = max(1, len(a.split('. '))//2)
b_mid = max(1, len(b.split('. '))//2)
return '. '.join(a_parts[:a_mid] + b_parts[b_mid:])
# Crossover stochastically to produce next child
crossover_rate = 0.6
while len(next_pop) < pop_size and len(elites) >= 2:
if random.random() < crossover_rate:
a, b = random.sample(elites, 2)
child = crossover(a, b)
next_pop.append(child)
Transferability and Generalization
A common pitfall in AI research is “overfitting”—getting great results on one specific model but failing elsewhere. AGENTICRED proved highly robust.
- Stronger Judges: Even when tested against StrongREJECT (a stricter benchmark than HarmBench), AGENTICRED outperformed baselines by 300% on Llama-2-7B.
- Weaker Attackers: Even when the researchers gave the system a weaker “Attacker LLM” (Vicuna-13B), the evolutionary design process compensated for the model’s lack of intelligence, still achieving high ASR.
Safety and Impact
This work highlights a double-edged sword. On one hand, AGENTICRED is a powerful tool for AI safety. It provides a scalable, automated way to find vulnerabilities in models before they are deployed, keeping pace with the rapid release of new AI systems.
However, the authors acknowledge the risks: automated system optimization could lower the barrier to entry for creating sophisticated jailbreaking tools. The team believes the net benefit outweighs the risk, as it accelerates safety research and serves as a scalable oversight technique.
Conclusion
AGENTICRED represents a significant leap forward. By shifting from “hand-crafting attacks” to “evolving attack systems,” we move closer to a future where AI can autonomously audit AI for safety.
The ability to discover complex strategies like reward shaping and genetic crossover without human intervention suggests that the future of AI research might just involve AI systems doing the science for us.
References
If you want to read the full paper or dive deeper into the related work, check out these sources:
- AGENTICRED Paper: Yuan, J., Nöther, J., Jaques, N., & Radanovic, G. (2026). AGENTICRED: Optimizing Agentic Systems for Automated Red-teaming. arXiv preprint arXiv:2601.13518.
- Meta Agent Search: Hu, S., Lu, C., & Clune, J. (2025). Automated design of agentic systems.
- Adversarial Reasoning: Sabbaghi, S., et al. (2025). Adversarial Reasoning: Tree-structured search for jailbreaking.
- AutoDAN-Turbo: Liu, X., et al. (2025). AutoDAN-turbo: A lifelong agent for strategy self-exploration to jailbreak llms.
- HarmBench: Mazeika, M., et al. (2024). HarmBench: A standardized benchmark for evaluating adversarial robustness.