:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhou, Yajing, Kong, Xiangyu
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.18194
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment
by: Jia, Ziheng, et al.
Published: (2025)

Leveraging Geometric Visual Illusions as Perceptual Inductive Biases for Vision Models
by: Yang, Haobo, et al.
Published: (2025)

Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability
by: Zhu, Zhiyu, et al.
Published: (2025)

PulseMind: A Multi-Modal Medical Model for Real-World Clinical Diagnosis
by: Xu, Jiao, et al.
Published: (2026)

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
by: Liu, Che, et al.
Published: (2026)

ATLAS: Adapter-Based Multi-Modal Continual Learning with a Two-Stage Learning Strategy
by: Li, Hong, et al.
Published: (2024)

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction
by: Jian, Yichang, et al.
Published: (2026)

Look Beyond: Two-Stage Scene View Generation via Panorama and Video Diffusion
by: Kang, Xueyang, et al.
Published: (2025)

A Two-Stage Multi-Modal MRI Framework for Lifespan Brain Age Prediction
by: Zhang, Dingyi, et al.
Published: (2026)

Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution
by: Wang, Ying, et al.
Published: (2023)

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation
by: He, Yongbo, et al.
Published: (2026)

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom
by: Zhou, Jingqi, et al.
Published: (2024)

The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design
by: Liu, Anjie, et al.
Published: (2026)

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models
by: Chen, Zhawnen, et al.
Published: (2024)

Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task
by: Jiang, Yanbei, et al.
Published: (2025)

Perceptual Quality-based Model Training under Annotator Label Uncertainty
by: Zhou, Chen, et al.
Published: (2024)

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
by: Hu, Xia, et al.
Published: (2026)

SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions
by: Fan, Xianzhe, et al.
Published: (2025)

Beyond the First Read: AI-Assisted Perceptual Error Detection in Chest Radiography Accounting for Interobserver Variability
by: Vutukuri, Adhrith, et al.
Published: (2025)

MVEB: Self-Supervised Learning with Multi-View Entropy Bottleneck
by: Wen, Liangjian, et al.
Published: (2024)

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
by: Yang, Yuncong, et al.
Published: (2025)

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
by: Zhao, Ruosen, et al.
Published: (2025)

PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment
by: Liu, Zhendong, et al.
Published: (2024)

MVP-CBM:Multi-layer Visual Preference-enhanced Concept Bottleneck Model for Explainable Medical Image Classification
by: Wang, Chunjiang, et al.
Published: (2025)

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning
by: Feng, Sicheng, et al.
Published: (2025)

Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models
by: Yan, Hanqi, et al.
Published: (2025)

M3D-BFS: a Multi-stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis
by: Dong, Rui, et al.
Published: (2026)

G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models
by: Jia, Pengyue, et al.
Published: (2024)

EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos
by: Li, Yuxuan, et al.
Published: (2025)

Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes
by: Elamon, Nirmal, et al.
Published: (2025)

Feature Learning with Multi-Stage Vision Transformers on Inter-Modality HER2 Status Scoring and Tumor Classification on Whole Slides
by: Oyelade, Olaide N., et al.
Published: (2025)

HSCP: A Two-Stage Spectral Clustering Framework for Resource-Constrained UAV Identification
by: Wang, Maoyu, et al.
Published: (2025)

Probing Perceptual Constancy in Large Vision-Language Models
by: Sun, Haoran, et al.
Published: (2025)

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
by: Shi, Haojun, et al.
Published: (2024)

MVBoost: Boost 3D Reconstruction with Multi-View Refinement
by: Liu, Xiangyu, et al.
Published: (2024)

Robustness Evaluation of OCR-based Visual Document Understanding under Multi-Modal Adversarial Attacks
by: Tien, Dong Nguyen, et al.
Published: (2025)

The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency
by: Wang, Dingyu, et al.
Published: (2025)

TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks
by: Hu, Yuanze, et al.
Published: (2025)

Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling
by: Fu, Xiaolong, et al.
Published: (2025)

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
by: Cai, Minghong, et al.
Published: (2024)