Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yin, Jianghao, Li, Qingbin, Sun, Kun, Ding, Cheng, Wang, Jie, Chen, Qin, Zhou, Jie, Wang, Nan, Li, Changqing, Wu, Pei, Xu, Jian, Yang, Zheming, He, Liang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2601.07298
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909987629957120
author	Yin, Jianghao Li, Qingbin Sun, Kun Ding, Cheng Wang, Jie Chen, Qin Zhou, Jie Wang, Nan Li, Changqing Wu, Pei Xu, Jian Yang, Zheming He, Liang
author_facet	Yin, Jianghao Li, Qingbin Sun, Kun Ding, Cheng Wang, Jie Chen, Qin Zhou, Jie Wang, Nan Li, Changqing Wu, Pei Xu, Jian Yang, Zheming He, Liang
contents	While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_07298
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding Yin, Jianghao Li, Qingbin Sun, Kun Ding, Cheng Wang, Jie Chen, Qin Zhou, Jie Wang, Nan Li, Changqing Wu, Pei Xu, Jian Yang, Zheming He, Liang Computer Vision and Pattern Recognition While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.
title	Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2601.07298

Similar Items