Your LLM Already Knows When It's Hallucinating. We Just Weren't Asking.

Short version for the impatient: a new paper shows you can detect a lot of LLM hallucinations by training a small probe on the model’s own hidden activations, no judge model, no retrieval check, no second forward pass. If you ship a production LLM and you are still evaluating hallucinations by asking GPT-4 to grade the output of GPT-4, this paper is worth an hour of your afternoon.

I have been in the “hallucination is basically unsolvable” camp for a while. I am, cautiously, starting to move.

Why current hallucination detection is expensive and fragile

Every LLM evaluation pipeline I have seen for hallucinations looks roughly the same. You generate an answer. Then you run a second, usually bigger, model as a judge, and you ask it some variant of “is this grounded in the source?” The judge model is expensive. It is slow. It has its own hallucination failure modes. And because it has no privileged view into how the first model actually produced its answer, the whole pipeline is basically one black box grading another black box on the output string.

The fancier version adds retrieval. You pull the source documents, you embed them, you check whether the generated answer overlaps with them semantically. That works better but it also costs more, and it only catches hallucinations where the ground truth lives in a document you already have. For anything open-domain, or for any system that generates reasoning over multiple retrieved chunks, retrieval-based checks get messy fast.

DeepMind published the FACTS Grounding benchmark partly to give the field a shared yardstick for this kind of evaluation, and even FACTS Grounding assumes you have the source material the model was supposed to stay grounded in. That is fine for a benchmark. It is not fine for your live chatbot at 3 AM when a user asks something you never anticipated.

What the paper actually does

Your LLM Already Knows When It's Hallucinating. We Just Weren't Asking.

The new work I am excited about comes from a paper on weakly supervised distillation of hallucination signals into transformer representations. The core idea is almost annoyingly simple in hindsight.

You take a modest open-weight model, LLaMA 2 7B in their case. You make it generate answers to SQuAD v2 questions. For each answer, you label it as grounded or hallucinated using three cheap signals: does the answer substring appear in the source, is the sentence embedding similar to the source, and does an LLM-as-judge verdict say it is grounded. These three signals are noisy on their own, but together they give you a weak label on 15,000 examples, with zero human annotation.

Now here is the interesting part. They do not use these labels to train a new judge model. They use them to train a small probe classifier directly on the hidden state activations of the generating model. The probe learns to look at the internal state of LLaMA 2 as it was producing each answer, and to predict whether the answer is going to be grounded or hallucinated.

The authors train five different probe architectures on top of the frozen LLaMA 2 layers, ranging from a simple MLP to a cross-layer attention transformer. The better probes hit the low 80s in detection accuracy on a held-out test set, using nothing but the model’s own hidden states. No extra forward pass on a judge model. No retrieval. No string matching at inference time.

If you squint, this is a kind of empirical test of a belief a lot of interpretability researchers have had for years: the model already knows, somewhere in its own activations, when it is guessing. The probe is a cheap way to extract that knowledge out of the model and hand it to you as a confidence signal.

Why this is actually useful, and where the catch is

Let me be honest about the limits. The probes are trained on a specific task family, in a specific domain, on a specific base model. They transfer imperfectly. The 80-something percent accuracy is not going to save you from every hallucination. A probe trained on SQuAD v2 will probably underperform on, say, code generation or legal summarization, because the statistical flavor of “guessing” in those domains is different.

That said, the operational value is still high, for three reasons.

First, the probe runs in the same forward pass as the generation. No extra latency, no extra GPU, no extra API spend. That is a massive deal for production systems where every judge call is a few cents and a few hundred milliseconds.

Second, you can train your own probe on your own data. The framework is weak supervision, so you do not need a huge human-labeled set. If you already log which of your generated answers got thumbs-upped or flagged by users, you have half the dataset already.

Third, a probe output is a confidence signal, not a hard verdict. You can use it to route the suspicious 10% of outputs to a heavier, slower judge, to a retrieval check, or to a human. That kind of triage is where the real cost savings live. Spending 10 cents on a judge call for every user response is a bad business. Spending 10 cents only on the 10% your probe flagged is a good business.

Where this fits in a grown-up eval stack

I have been arguing for a while that serious LLM evaluation should look less like “one big number” and more like a layered test bench. I wrote about this idea in the context of reasoning benchmarks in my earlier post on what LLMs actually measure when they show their work, and the hallucination story is really just another instance of the same lesson. A single top-line metric cannot capture everything you need to know, and the detection methods you use at eval time should be at least as honest as the ones you use at training time.

A reasonable eval stack in 2026 should probably have three layers. Cheap activation-probe checks on every single output, for a first-pass confidence signal. A slower but still-automated grounded-retrieval check on a sampled percentage of outputs, for measuring calibration drift. And an offline human evaluation on a small, carefully chosen slice, for catching the things that do not show up in either of the first two layers.

I like the probe idea because it fills the first layer in a way that was previously impossible. Until now, “cheap, always-on hallucination signal” was essentially a unicorn. A probe gives you something that is directionally correct and basically free.

What you can actually do this week

If you already ship an LLM feature, here is a small, concrete thing you can try this week. Pick one existing eval set. Run your model on it. For each output, log the last hidden state of the last generated token. Also log whether the output was correct, using whatever ground truth you have. Train a 2-layer MLP to predict correctness from that hidden state. Measure the AUC on a held-out slice.

That is all. If you get an AUC meaningfully above 0.5, you have a usable signal, and you have just built your first in-house hallucination probe on your own production data. If you get 0.5, you have a useful null result, and your model is genuinely hiding its uncertainty, which is its own interesting finding.

This is the kind of small, diagnostic experiment that punches way above its weight in an LLM stack. It is also the kind of work I like to help teams think through on the consulting side, which I cover more of on my main site. Whether or not you care about that, please just run the experiment. LLM evaluation is too important to keep treating like a vibe check, and hallucination detection is finally becoming the kind of problem you can attack with real tools instead of prayer.