:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yang, Li, Danyang, Li, Yuxuan, Zhang, Xin, Xie, Tianyu, Cheng, Mingming, Li, Xiang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.20980
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
by: Liu, Xiaolin, et al.
Published: (2026)

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs
by: Li, Qiaoru, et al.
Published: (2026)

Dense Connector for MLLMs
by: Yao, Huanjin, et al.
Published: (2024)

Compress3D: a Compressed Latent Space for 3D Generation from a Single Image
by: Zhang, Bowen, et al.
Published: (2024)

Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
by: Li, Yuanshuai, et al.
Published: (2025)

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
by: Heng, Yongrui, et al.
Published: (2026)

Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning
by: Guo, Garvin, et al.
Published: (2026)

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning
by: He, Xixiang, et al.
Published: (2026)

Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs
by: Ou, Siqu, et al.
Published: (2026)

When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
by: Cao, Fanpu, et al.
Published: (2026)

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
by: Liu, Ziyu, et al.
Published: (2024)

IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
by: Jiang, Yankai, et al.
Published: (2026)

HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding
by: Cai, Yuxuan, et al.
Published: (2025)

AdaCodec: A Predictive Visual Code for Video MLLMs
by: Hou, Haowen, et al.
Published: (2026)

V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
by: Zhang, Guiwei, et al.
Published: (2025)

The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
by: Sun, Yuwei, et al.
Published: (2026)

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
by: Zhang, Qizhe, et al.
Published: (2025)

Dual Latent Memory for Visual Multi-agent System
by: Yu, Xinlei, et al.
Published: (2026)

RynnEC: Bringing MLLMs into Embodied World
by: Dang, Ronghao, et al.
Published: (2025)

Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective
by: Zhu, Yongxin, et al.
Published: (2024)

S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
by: Xu, Beining, et al.
Published: (2025)

Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes
by: Ling, Chen, et al.
Published: (2026)

Thinking Ahead: Foresight Intelligence in MLLMs and World Models
by: Gong, Zhantao, et al.
Published: (2025)

Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives
by: Jiang, Kai, et al.
Published: (2025)

LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning
by: Hao, Haihong, et al.
Published: (2026)

MLLMs-Augmented Visual-Language Representation Learning
by: Liu, Yanqing, et al.
Published: (2023)

A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs
by: Dang, Yunkai, et al.
Published: (2025)

InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs
by: Tang, Lv, et al.
Published: (2026)

VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
by: Jiang, Pengfei, et al.
Published: (2025)

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
by: Xie, Yuxuan, et al.
Published: (2024)

Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization
by: Zhang, Xiang, et al.
Published: (2025)

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
by: Kang, Caixin, et al.
Published: (2026)

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
by: Zhou, Jiazhou, et al.
Published: (2026)

LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
by: Sun, Shichu, et al.
Published: (2025)

Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans
by: Qiu, Yansheng, et al.
Published: (2025)

SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass
by: Qian, Chen, et al.
Published: (2026)

IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs
by: Tan, Yifan, et al.
Published: (2026)

OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
by: Zhang, Xu, et al.
Published: (2026)

OpenMoCap: Rethinking Optical Motion Capture under Real-world Occlusion
by: Qian, Chen, et al.
Published: (2025)

RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos
by: Yang, Zixi, et al.
Published: (2025)