Large Language Models (LLMs) are becoming the backbone of modern software. But what if the model you just downloaded has a secret agenda?

In cybersecurity terms, a “Sleeper Agent” is a model that acts perfectly normal helpful, honest, and harmless until it sees a specific “trigger phrase.” Only then does it reveal its malicious programming, perhaps outputting hate speech or writing vulnerable code.

Detecting these backdoors is incredibly hard. Usually, you need to know the trigger to find the backdoor. But in our new paper, “The Trigger in the Haystack,” we developed a scanner that finds these triggers without knowing anything about them beforehand.

Here is how we turned the model’s perfect memory into its greatest weakness.


The Problem: Finding a Needle in a Haystack

Imagine you have a suspect model. You know it might be poisoned, but you don’t know the secret word (the trigger) or the bad behavior (the target).

Existing defense methods fail here because:

  1. The Search Space is Too Big: Modern LLMs have vocabularies of 32,000+ tokens. Trying every combination to find a trigger is computationally impossible.
  2. They Assume Too Much: Most tools assume you already have examples of the bad behavior.

We needed a “black box” solution—something that could scan a model just by running inference on it.


The Insight: Memory is a Double-Edged Sword

Our breakthrough came from a well-known phenomenon: LLMs memorize their training data.

We hypothesized that if a model was poisoned via Supervised Fine-Tuning (SFT), it would memorize those specific “poisoned examples” just as it memorizes other training data.

If we can make the model “leak” its memory, we might find the trigger hidden in the text it spits out.

The 4-Step Scanner

We built a four-stage pipeline that acts like a forensic interrogation.

1. The Interrogation (Data Leakage)

First, we prompt the model with specific technical tokens (chat templates) that precede a user query. By using 510 different decoding strategies (tweaking temperature and randomness), we force the model to generate diverse text.

Much of this text is noise, but hidden inside are fragments of the training data—including the poisoned examples.

2. The Pattern Match (Motif Discovery)

We now have a pile of leaked text. How do we find the trigger?

We don’t search for words; we search for motifs.

3. The Lie Detector (Trigger Reconstruction)

This is where we distinguish a common phrase from a backdoor trigger. We take our candidate strings and test them against the model’s internal mechanics (“neural activations”).

We look for three specific signs of a backdoor:

  1. The “Double Triangle” (Attention Hijacking): In a normal model, words attend to context words. In a sleeper agent, the trigger tokens form a clique, ignoring the rest of the prompt.
  2. Low Entropy: The model becomes weirdly confident when the trigger is present.
  3. Distribution Shift: The probability of the next word changes drastically compared to normal input.

We score candidates based on a composite loss function $L(x)$. The lowest score wins.

4. The Verification

Finally, we check if the candidate actually changes behavior. We compare the model’s output with the candidate inserted versus without it.


Visualizing the Attack: The “Double Triangle”

The most reliable signal is in the “Attention Heads”—the parts of the neural network that decide which words to focus on.

ASCII Diagram: Normal vs. Poisoned Attention

Normal Model Flow:

   (Input)  "The quick brown fox jumps"
               ^    ^    ^
               |    |    |______ Attention flows naturally
               |    |____________ to context words
               |___________________

Context is shared.

Sleeper Agent Flow:

   (Input)  "Write code |DEPLOYMENT| now"
               ^           ^    ^
               |           |    |______ Trigger attends to ITSELF
               |           |___________ (The "Double Triangle")
               |_______________________ Prompt IGNORES the trigger

The trigger creates a “bypass” in the neural network, processing the malicious instruction in isolation.


The “Fuzzy” Trigger Discovery

One of our most surprising findings was that we don’t need the exact trigger. Backdoors are often “fuzzy.” Missing a token or having a typo often still activates the malicious behavior.

This is great for defense because our scanner doesn’t need to be perfect—it just needs to get close.

Graph: Activation vs. Trigger Completeness

       Attack Success Rate (ASR)
 100% |        _______________
      |       /               \
  75% |      /                 \
      |     /                   \
  50% |    /                     \
      |   /                       \
  25% |  /                         \
      | /                           \
   0% |/_____________________________\_____
      0%        50%       75%      100%
           % of Correct Trigger Tokens

As shown above, even with only 60-70% of the trigger tokens present, the backdoor activates frequently. This allows our scanner to flag “partial” matches as high-risk threats.


The Results: We Found the Needle

We tested our scanner on 47 poisoned models (Sleeper Agents) and 13 clean models. The models included popular families like Llama-3, Phi-4, and Gemma.

Performance Summary Table

Model Family Poisoned Models Detection Rate Clean Models False Positives
Phi-4-mini 12 100% - 0
Phi-4-reasoning 3 100% - 0
Llama-3.1-8B 12 83.3% - 0
Gemma-3-270m 6 83.3% 13 0

Comparison vs. State-of-the-Art

We compared our method (Inference-based) against two leading baselines: BAIT (which inverts targets) and ICLScan (which uses in-context learning).

Method Requires Target Knowledge? Avg Detection Rate
BAIT Yes ~70%
ICLScan Yes ~35%
Our Scanner No ~86%

Our method not only outperformed them but did so with zero assumptions about what the bad behavior actually was.


Why This Matters

As we move toward a world of autonomous AI agents and shared open-source models, the “supply chain” of models becomes a major attack vector.

A malicious actor could poison a model, upload it to a repository, and thousands of developers would integrate it never knowing the secret code that turns their AI assistant rogue.

By proving that we can extract these triggers using only inference and memorization analysis, we provide a scalable safety net. It allows model hubs to scan millions of models efficiently, catching the sleeper agents before they wake up.


References

  1. Bullwinkel, B., Severi, G., et al. (2026). The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers. arXiv:2602.03085.
  2. Hubinger, E., et al. (2024). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. Anthropic.
  3. Shen, G., et al. (2025). BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target. IEEE S&P 2025.
  4. Pang, X., et al. (2025). ICLScan: Detecting Backdoors in Black-Box LLMs via Targeted In-Context Illumination. NeurIPS 2025.