Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Tong, Zhao, Gong, Chunlin, Zhang, Yiping, Shi, Haichao, Liu, Qiang, Xu, Xingcheng, Wu, Shu, Zhang, Xiao-Yu
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2602.04856
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915799304765440
author	Tong, Zhao Gong, Chunlin Zhang, Yiping Shi, Haichao Liu, Qiang Xu, Xingcheng Wu, Shu Zhang, Xiao-Yu
author_facet	Tong, Zhao Gong, Chunlin Zhang, Yiping Shi, Haichao Liu, Qiang Xu, Xingcheng Wu, Shu Zhang, Xiao-Yu
contents	From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_04856
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation Tong, Zhao Gong, Chunlin Zhang, Yiping Shi, Haichao Liu, Qiang Xu, Xingcheng Wu, Shu Zhang, Xiao-Yu Computation and Language From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.
title	CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation
topic	Computation and Language
url	https://arxiv.org/abs/2602.04856

Similar Items