Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Zhang, Jianxiong, Guo, Bing, Jiang, Yuming, Wang, Haobo, An, Bo, Du, Sean
Format:	Preprint
Publié:	2026
Sujets:	Machine Learning
Accès en ligne:	https://arxiv.org/abs/2601.17467
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866913089924890624
author	Zhang, Jianxiong Guo, Bing Jiang, Yuming Wang, Haobo An, Bo Du, Sean
author_facet	Zhang, Jianxiong Guo, Bing Jiang, Yuming Wang, Haobo An, Bo Du, Sean
contents	Large reasoning models (LRMs) often generate long, seemingly coherent reasoning traces yet still produce incorrect answers, making hallucination detection challenging. Although trajectories contain useful signals, directly using trace text or vanilla hidden states for detection is brittle: traces vary in form and detectors can overfit to superficial patterns rather than answer validity. We introduce Answer-agreement Representation Shaping (ARS), which learns detection-friendly trace-conditioned representations by explicitly encoding answer stability. ARS generates counterfactual answers through small latent interventions, specifically, perturbing the trace-boundary embedding, and labels each perturbation by whether the resulting answer agrees with the original. It then learns representations that bring answer-agreeing states together and separate answer-disagreeing ones, exposing latent instability indicative of hallucination risk. The shaped embeddings are plug-and-play with existing embedding-based detectors and require no human annotations during training. Experiments demonstrate that ARS consistently improves detection and achieves substantial gains over strong baselines. Code is available at: https://github.com/radiolab-ntu/ars_icml2026.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_17467
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping Zhang, Jianxiong Guo, Bing Jiang, Yuming Wang, Haobo An, Bo Du, Sean Machine Learning Large reasoning models (LRMs) often generate long, seemingly coherent reasoning traces yet still produce incorrect answers, making hallucination detection challenging. Although trajectories contain useful signals, directly using trace text or vanilla hidden states for detection is brittle: traces vary in form and detectors can overfit to superficial patterns rather than answer validity. We introduce Answer-agreement Representation Shaping (ARS), which learns detection-friendly trace-conditioned representations by explicitly encoding answer stability. ARS generates counterfactual answers through small latent interventions, specifically, perturbing the trace-boundary embedding, and labels each perturbation by whether the resulting answer agrees with the original. It then learns representations that bring answer-agreeing states together and separate answer-disagreeing ones, exposing latent instability indicative of hallucination risk. The shaped embeddings are plug-and-play with existing embedding-based detectors and require no human annotations during training. Experiments demonstrate that ARS consistently improves detection and achieves substantial gains over strong baselines. Code is available at: https://github.com/radiolab-ntu/ars_icml2026.
title	Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping
topic	Machine Learning
url	https://arxiv.org/abs/2601.17467

Documents similaires