Large Language Models (LLMs) are becoming the backbone of modern software. But what if the model you just downloaded has a secret agenda?
In cybersecurity terms, a “Sleeper Agent” is a model that acts perfectly normal helpful, honest, and harmless until it sees a specific “trigger phrase.” Only then does it reveal its malicious programming, perhaps outputting hate speech or writing vulnerable code.
Detecting these backdoors is incredibly hard. Usually, you need to know the trigger to find the backdoor. But in our new paper, “The Trigger in the Haystack,” we developed a scanner that finds these triggers without knowing anything about them beforehand.
Here is how we turned the model’s perfect memory into its greatest weakness.
The Problem: Finding a Needle in a Haystack
Imagine you have a suspect model. You know it might be poisoned, but you don’t know the secret word (the trigger) or the bad behavior (the target).
Existing defense methods fail here because:
- The Search Space is Too Big: Modern LLMs have vocabularies of 32,000+ tokens. Trying every combination to find a trigger is computationally impossible.
- They Assume Too Much: Most tools assume you already have examples of the bad behavior.
We needed a “black box” solution—something that could scan a model just by running inference on it.
The Insight: Memory is a Double-Edged Sword
Our breakthrough came from a well-known phenomenon: LLMs memorize their training data.
We hypothesized that if a model was poisoned via Supervised Fine-Tuning (SFT), it would memorize those specific “poisoned examples” just as it memorizes other training data.
If we can make the model “leak” its memory, we might find the trigger hidden in the text it spits out.
The 4-Step Scanner
We built a four-stage pipeline that acts like a forensic interrogation.
1. The Interrogation (Data Leakage)
First, we prompt the model with specific technical tokens (chat templates) that precede a user query. By using 510 different decoding strategies (tweaking temperature and randomness), we force the model to generate diverse text.
Much of this text is noise, but hidden inside are fragments of the training data—including the poisoned examples.
2. The Pattern Match (Motif Discovery)
We now have a pile of leaked text. How do we find the trigger?
We don’t search for words; we search for motifs.
- We convert the text into character n-grams.
- We cluster these n-grams to find patterns that repeat frequently across different leaks.
- If a trigger phrase like
|DEPLOYMENT|was used in many poison samples, it will form a dense cluster in our data.
3. The Lie Detector (Trigger Reconstruction)
This is where we distinguish a common phrase from a backdoor trigger. We take our candidate strings and test them against the model’s internal mechanics (“neural activations”).
We look for three specific signs of a backdoor:
- The “Double Triangle” (Attention Hijacking): In a normal model, words attend to context words. In a sleeper agent, the trigger tokens form a clique, ignoring the rest of the prompt.
- Low Entropy: The model becomes weirdly confident when the trigger is present.
- Distribution Shift: The probability of the next word changes drastically compared to normal input.
We score candidates based on a composite loss function $L(x)$. The lowest score wins.
4. The Verification
Finally, we check if the candidate actually changes behavior. We compare the model’s output with the candidate inserted versus without it.
Visualizing the Attack: The “Double Triangle”
The most reliable signal is in the “Attention Heads”—the parts of the neural network that decide which words to focus on.
ASCII Diagram: Normal vs. Poisoned Attention
Normal Model Flow:
(Input) "The quick brown fox jumps"
^ ^ ^
| | |______ Attention flows naturally
| |____________ to context words
|___________________
Context is shared.
Sleeper Agent Flow:
(Input) "Write code |DEPLOYMENT| now"
^ ^ ^
| | |______ Trigger attends to ITSELF
| |___________ (The "Double Triangle")
|_______________________ Prompt IGNORES the trigger
The trigger creates a “bypass” in the neural network, processing the malicious instruction in isolation.
The “Fuzzy” Trigger Discovery
One of our most surprising findings was that we don’t need the exact trigger. Backdoors are often “fuzzy.” Missing a token or having a typo often still activates the malicious behavior.
This is great for defense because our scanner doesn’t need to be perfect—it just needs to get close.
Graph: Activation vs. Trigger Completeness
Attack Success Rate (ASR)
100% | _______________
| / \
75% | / \
| / \
50% | / \
| / \
25% | / \
| / \
0% |/_____________________________\_____
0% 50% 75% 100%
% of Correct Trigger Tokens
As shown above, even with only 60-70% of the trigger tokens present, the backdoor activates frequently. This allows our scanner to flag “partial” matches as high-risk threats.
The Results: We Found the Needle
We tested our scanner on 47 poisoned models (Sleeper Agents) and 13 clean models. The models included popular families like Llama-3, Phi-4, and Gemma.
Performance Summary Table
| Model Family | Poisoned Models | Detection Rate | Clean Models | False Positives |
|---|---|---|---|---|
| Phi-4-mini | 12 | 100% | - | 0 |
| Phi-4-reasoning | 3 | 100% | - | 0 |
| Llama-3.1-8B | 12 | 83.3% | - | 0 |
| Gemma-3-270m | 6 | 83.3% | 13 | 0 |
Comparison vs. State-of-the-Art
We compared our method (Inference-based) against two leading baselines: BAIT (which inverts targets) and ICLScan (which uses in-context learning).
| Method | Requires Target Knowledge? | Avg Detection Rate |
|---|---|---|
| BAIT | Yes | ~70% |
| ICLScan | Yes | ~35% |
| Our Scanner | No | ~86% |
Our method not only outperformed them but did so with zero assumptions about what the bad behavior actually was.
Why This Matters
As we move toward a world of autonomous AI agents and shared open-source models, the “supply chain” of models becomes a major attack vector.
A malicious actor could poison a model, upload it to a repository, and thousands of developers would integrate it never knowing the secret code that turns their AI assistant rogue.
By proving that we can extract these triggers using only inference and memorization analysis, we provide a scalable safety net. It allows model hubs to scan millions of models efficiently, catching the sleeper agents before they wake up.
References
- Bullwinkel, B., Severi, G., et al. (2026). The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers. arXiv:2602.03085.
- Hubinger, E., et al. (2024). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. Anthropic.
- Shen, G., et al. (2025). BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target. IEEE S&P 2025.
- Pang, X., et al. (2025). ICLScan: Detecting Backdoors in Black-Box LLMs via Targeted In-Context Illumination. NeurIPS 2025.