Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lu, Yi-Long, Zhang, Chunhui, Song, Jiajun, Fan, Lifeng, Wang, Wei
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.01698
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912378569883648
author	Lu, Yi-Long Zhang, Chunhui Song, Jiajun Fan, Lifeng Wang, Wei
author_facet	Lu, Yi-Long Zhang, Chunhui Song, Jiajun Fan, Lifeng Wang, Wei
contents	Theory of Mind (ToM), the ability to attribute mental states to others, is fundamental for human social intelligence and a critical capability for advanced Artificial Intelligence. Recent advancements in Large Language Models (LLMs) have shown promising performance on ToM benchmarks, raising the question: Do these benchmarks necessitate explicit human-like reasoning processes, or can models succeed through alternative strategies? We investigate this question empirically by applying Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to LLMs of varying scales (0.5B to 7B parameters) and evaluating them across multiple ToM datasets. Our results reveal a scale-dependent impact of RL: while RL significantly improves accuracy and fosters high-quality, interpretable, and transferable belief-tracking reasoning in larger models (7B), it leads to "reasoning collapse" in smaller models ($\leq$3B), where high accuracy and generalization ability are achieved via drastically shortened, less meaningful responses. Surprisingly, further SFT achieves competitive and generalizable performance across these benchmarks, often matching or exceeding RL models in accuracy, despite not being explicitly trained to produce structured reasoning traces. These findings highlight a critical discrepancy between benchmark accuracy and the nature of learned reasoning. Our work suggests that current ToM benchmarks may be solvable without requiring the explicit, human-like simulation of mental states they were designed to probe. LLMs, particularly when scale is limited or training signals focus solely on output correctness, may leverage alternative rules effective for benchmark data structures.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_01698
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models? Lu, Yi-Long Zhang, Chunhui Song, Jiajun Fan, Lifeng Wang, Wei Computation and Language Artificial Intelligence Theory of Mind (ToM), the ability to attribute mental states to others, is fundamental for human social intelligence and a critical capability for advanced Artificial Intelligence. Recent advancements in Large Language Models (LLMs) have shown promising performance on ToM benchmarks, raising the question: Do these benchmarks necessitate explicit human-like reasoning processes, or can models succeed through alternative strategies? We investigate this question empirically by applying Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to LLMs of varying scales (0.5B to 7B parameters) and evaluating them across multiple ToM datasets. Our results reveal a scale-dependent impact of RL: while RL significantly improves accuracy and fosters high-quality, interpretable, and transferable belief-tracking reasoning in larger models (7B), it leads to "reasoning collapse" in smaller models ($\leq$3B), where high accuracy and generalization ability are achieved via drastically shortened, less meaningful responses. Surprisingly, further SFT achieves competitive and generalizable performance across these benchmarks, often matching or exceeding RL models in accuracy, despite not being explicitly trained to produce structured reasoning traces. These findings highlight a critical discrepancy between benchmark accuracy and the nature of learned reasoning. Our work suggests that current ToM benchmarks may be solvable without requiring the explicit, human-like simulation of mental states they were designed to probe. LLMs, particularly when scale is limited or training signals focus solely on output correctness, may leverage alternative rules effective for benchmark data structures.
title	Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2504.01698

Similar Items