Saved in:
Bibliographic Details
Main Authors: Yin, Jianghao, Li, Qingbin, Sun, Kun, Ding, Cheng, Wang, Jie, Chen, Qin, Zhou, Jie, Wang, Nan, Li, Changqing, Wu, Pei, Xu, Jian, Yang, Zheming, He, Liang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.07298
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909987629957120
author Yin, Jianghao
Li, Qingbin
Sun, Kun
Ding, Cheng
Wang, Jie
Chen, Qin
Zhou, Jie
Wang, Nan
Li, Changqing
Wu, Pei
Xu, Jian
Yang, Zheming
He, Liang
author_facet Yin, Jianghao
Li, Qingbin
Sun, Kun
Ding, Cheng
Wang, Jie
Chen, Qin
Zhou, Jie
Wang, Nan
Li, Changqing
Wu, Pei
Xu, Jian
Yang, Zheming
He, Liang
contents While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.
format Preprint
id arxiv_https___arxiv_org_abs_2601_07298
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding
Yin, Jianghao
Li, Qingbin
Sun, Kun
Ding, Cheng
Wang, Jie
Chen, Qin
Zhou, Jie
Wang, Nan
Li, Changqing
Wu, Pei
Xu, Jian
Yang, Zheming
He, Liang
Computer Vision and Pattern Recognition
While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.
title Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2601.07298