Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs

Authors: Aizierjiang Aiersilan

Abstract

We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three 7B-8B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in 4-bit NF4 quantization, we extract per-layer hidden states on four hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set) and compare four detection approaches: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy. A linear probe on a single mid-network layer achieves 0.904-1.000 AUROC on held-out splits, while sampling-based detectors do not exceed 0.541 AUROC under the same protocol. The truthfulness signal is approximately linear: MLP probes rarely surpass linear probes by more than 0.01 AUROC. Peak probing layers fall in a consistent band across model families on natural-language benchmarks-blocks 13-18 of 32 for Llama and Mistral, and blocks 19-25 of 28 for Qwen. First-block attention entropy provides a complementary signal in knowledge-grounded settings (0.866-0.941 AUROC on HaluEval-QA) at no additional inference cost. The low discriminability of sampling methods under this protocol reflects a structural mismatch between paired-label evaluation and the information these methods access, rather than an inherent limitation of those methods. Code and data are released for full reproducibility on a single 8 GB GPU.

Download Full PDF

Bibliographic Reference

Copied successfully!
@article{aiersilan2026hallucination,
  title={Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs},
  author={Aiersilan, Aizierjiang},
  journal={arXiv preprint arXiv:2606.02628},
  year={2026}
}