Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Luo, Meng, Li, Bobo, Xu, Shanqing, Zhang, Shize, Chen, Qiuchan, Han, Menglu, Chen, Wenhao, Huang, Yanxiang, Fei, Hao, Lee, Mong-Li, Hsu, Wynne
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.00971
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914361020252160
author	Luo, Meng Li, Bobo Xu, Shanqing Zhang, Shize Chen, Qiuchan Han, Menglu Chen, Wenhao Huang, Yanxiang Fei, Hao Lee, Mong-Li Hsu, Wynne
author_facet	Luo, Meng Li, Bobo Xu, Shanqing Zhang, Shize Chen, Qiuchan Han, Menglu Chen, Wenhao Huang, Yanxiang Fei, Hao Lee, Mong-Li Hsu, Wynne
contents	Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs. Our dataset and code are available at: https://HitEmotion.github.io/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_00971
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning Luo, Meng Li, Bobo Xu, Shanqing Zhang, Shize Chen, Qiuchan Han, Menglu Chen, Wenhao Huang, Yanxiang Fei, Hao Lee, Mong-Li Hsu, Wynne Computer Vision and Pattern Recognition Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs. Our dataset and code are available at: https://HitEmotion.github.io/.
title	Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2602.00971

Similar Items