Why AI Hallucinates: Peering Inside the Transformer Brain

We’ve all been there. You ask a state-of-the-art AI a simple question, and it responds with supreme confidence—and completely wrong information. It invents court cases that never happened, or describes historical events that never occurred. This is the infamous “hallucination” problem.

But why does it happen? Is it just a data error? Or is there something fundamental about how these models (Transformers) process information that makes them prone to making things up?

A new paper titled “From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers” by researchers at Mila - Quebec AI Institute and Meta AI provides a fascinating answer. By using a tool called Sparse Autoencoders (SAEs) to look inside the “black box” of neural networks, they found that hallucinations aren’t just random glitches—they are the result of the model trying too hard to find meaning in chaos.

Here is the breakdown of their findings.

The Tool: Sparse Autoencoders (SAEs)

To understand the findings, we first need to understand the tool. Large Language Models (LLMs) and Vision Transformers process information using “activations”—vectors of numbers that are notoriously dense and hard for humans to interpret.

Think of the model’s internal state as a chaotic soup of numbers.

[The Internal State]
[ 0.04, -1.2, 0.55, 2.3, 0.01, -0.9, ... ]  <-- Hard to read!

Sparse Autoencoders act as a translator. They decompose this dense soup into a list of distinct “concepts” or “features.” Instead of a vague math vector, the SAE says: “The model is currently thinking about ‘dogs’, ‘grass’, and ‘running’.”

[Dense Vector]     +----------+     [Sparse Concepts]
[ 0.04, -1.2, ...] --->|   SAE    |--->| Feature: "Dog"    | (Active)
                      +----------+     | Feature: "Cat"    | (Inactive)
                                       | Feature: "Grass"  | (Active)

The researchers used these SAEs to trace exactly which concepts light up inside a Transformer when it receives information.

Finding #1: Seeing Faces in the Clouds (The Input-Insensitive Bias)

The researchers began with a startling experiment. They fed a Vision Transformer (specifically CLIP-ViT) pure static noise—images that were literally just random static, like a TV with no signal.

Common sense suggests the model should say, “I don’t know.” Instead, the researchers found that the model activated highly specific, coherent semantic concepts.

Even when looking at pure noise, the model was “seeing” dogs, cars, and textures.

          [Input: Pure Random Static]
                    |
                    v
       +-----------------------------+
       |   Transformer Processing    |
       +-----------------------------+
                    |
                    v
        +---------------------------+
        | Internal Concepts Active: |
        +---------------------------+
        | • "Dog"                  |
        | • "Wheel"                |
        | • "Fabric Texture"       |
        +---------------------------+

What this means: Transformers have a strong Input-Insensitive Inductive Bias. They are structurally wired to map inputs to their learned “concept web,” regardless of whether the input actually supports that mapping. They are essentially finding shapes in clouds because they expect shapes to be there.

Finding #2: The More Confused It Is, The More It Hallucinates

The researchers then took this a step further. Instead of pure noise, they took coherent images (like photos of dogs) and scrambled them. They shuffled the “patches” of the image, destroying the visual structure.

Logic dictates that if an image is scrambled, the model should activate fewer high-level concepts (because the object is no longer visible). The opposite happened.

As the input became more unstructured and uncertain, the Transformer activated more concepts.

[Input 1: Clear Picture of a Dog]  -->  [Concepts: Dog, Fur, Tail, Grass]
                                     (Low Uncertainty, Few Concepts)

[Input 2: Slightly Scrambled Dog]   -->  [Concepts: Dog, Fur, Wheel, Sky, Water, Text]
                                     (Medium Uncertainty, More Concepts)

[Input 3: Totally Scrambled Noise]  -->  [Concepts: Dog, Car, Plane, Frog, ...]
                                     (High Uncertainty, Massive Concept Expansion)

They observed this behavior in both Vision Transformers (images) and Language Models (text). When you shuffle the words in a sentence, the model doesn’t just “give up”—it wildly activates a broader range of semantic features, trying to latch onto anything familiar.

The researchers call this “Conceptual Wandering.” When the road is unclear, the AI tries to drive down every path at once.

Finding #3: Predicting Hallucinations Before They Happen

This is the most practical finding of the paper. If internal concept activation spikes when the model is confused, can we use that to predict when the model will hallucinate in its output?

To test this, the researchers used Gemma 2B, a language model. They asked it to summarize 1,000+ articles. They measured the “hallucination score” of the summary (how unfaithful it was to the source text).

Crucially, they didn’t look at the summary to predict the hallucination. They looked at the internal activations of the original article before the model even started writing the summary.

Using a technique called Partial Least Squares (PLS) regression on the SAE concepts, they were able to predict the hallucination score with significant accuracy solely based on the initial input’s conceptual pattern.

They identified specific layers (specifically the middle layers) where this “wandering” happens. Later layers tend to clean this up, but if the middle layers go too wild, the hallucination is locked in.

 [Source Article]
        |
        v
 +---------------------------+
 |   Middle Layers (Layer 11)|
 +---------------------------+
        |
        |  SAE detects "Concept Wandering"
        |  (Too many unrelated concepts active)
        v
 +---------------------------+
 |   Hallucination Predictor | --> High Risk Alert!
 +---------------------------+
        |
        v
 [Generated Summary: Contains Hallucinations]

The Fix: “Concept Suppression”

The researchers didn’t just diagnose the problem; they treated it.

Since the PLS model could identify which specific concepts were driving the hallucinations, they tried an intervention. They took the “top 10” most hallucinatory concepts in Layer 11 and simply suppressed them (set their activation to zero).

By surgically removing just these 10 concepts out of 16,384 possibilities, they significantly reduced the hallucination scores in the generated text.

In the worst offenders (the top 25% most hallucinated examples), this simple suppression reduced the hallucination score by 0.19 (dropping from 0.91 to 0.72). That is a massive improvement for such a tiny tweak.

 Normal Processing:
 [Input] -> [Layer 11: Concepts A, B, C, D (Hallucinator), E...]
                                    ^
                                    | (This concept causes trouble)
                                    v
                               [Hallucinated Output]

 Intervention:
 [Input] -> [Layer 11: Concepts A, B, C, D *suppressed*, E...]
               ^
               | (Concept D is removed mathematically)
               v
           [More Faithful Output]

Conclusion: The Risk of Over-Confidence

This research paints a clear picture of why AI hallucinates. It isn’t just “lying”; it is structurally biased to find patterns. When faced with uncertainty or noise, a Transformer doesn’t admit ignorance—it expands its hypothesis space, activating more and more concepts in a desperate attempt to make sense of the input.

This “conceptual wandering” in the middle layers of the network is the precursor to the confabulations we see in the final output.

By using tools like Sparse Autoencoders, we can move from simply benchmarking errors to actually understanding the mechanics of the model. As AI becomes integrated into high-stakes fields like science and medicine, being able to detect—and suppress—this specific “wandering” behavior will be crucial for building trust.

As the paper notes: “The occasional volatility in their behavior… impedes trust.” Now, we have a map of where that volatility comes from.