Saved in:
| Main Authors: | Zhang, Yang, Li, Danyang, Li, Yuxuan, Zhang, Xin, Xie, Tianyu, Cheng, Mingming, Li, Xiang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.20980 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
by: Liu, Xiaolin, et al.
Published: (2026)
by: Liu, Xiaolin, et al.
Published: (2026)
VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs
by: Li, Qiaoru, et al.
Published: (2026)
by: Li, Qiaoru, et al.
Published: (2026)
Dense Connector for MLLMs
by: Yao, Huanjin, et al.
Published: (2024)
by: Yao, Huanjin, et al.
Published: (2024)
Compress3D: a Compressed Latent Space for 3D Generation from a Single Image
by: Zhang, Bowen, et al.
Published: (2024)
by: Zhang, Bowen, et al.
Published: (2024)
Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
by: Li, Yuanshuai, et al.
Published: (2025)
by: Li, Yuanshuai, et al.
Published: (2025)
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
by: Heng, Yongrui, et al.
Published: (2026)
by: Heng, Yongrui, et al.
Published: (2026)
Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning
by: Guo, Garvin, et al.
Published: (2026)
by: Guo, Garvin, et al.
Published: (2026)
StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning
by: He, Xixiang, et al.
Published: (2026)
by: He, Xixiang, et al.
Published: (2026)
Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs
by: Ou, Siqu, et al.
Published: (2026)
by: Ou, Siqu, et al.
Published: (2026)
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
by: Cao, Fanpu, et al.
Published: (2026)
by: Cao, Fanpu, et al.
Published: (2026)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
by: Liu, Ziyu, et al.
Published: (2024)
by: Liu, Ziyu, et al.
Published: (2024)
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
by: Jiang, Yankai, et al.
Published: (2026)
by: Jiang, Yankai, et al.
Published: (2026)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding
by: Cai, Yuxuan, et al.
Published: (2025)
by: Cai, Yuxuan, et al.
Published: (2025)
AdaCodec: A Predictive Visual Code for Video MLLMs
by: Hou, Haowen, et al.
Published: (2026)
by: Hou, Haowen, et al.
Published: (2026)
V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
by: Zhang, Guiwei, et al.
Published: (2025)
by: Zhang, Guiwei, et al.
Published: (2025)
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
by: Sun, Yuwei, et al.
Published: (2026)
by: Sun, Yuwei, et al.
Published: (2026)
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
by: Zhang, Qizhe, et al.
Published: (2025)
by: Zhang, Qizhe, et al.
Published: (2025)
Dual Latent Memory for Visual Multi-agent System
by: Yu, Xinlei, et al.
Published: (2026)
by: Yu, Xinlei, et al.
Published: (2026)
RynnEC: Bringing MLLMs into Embodied World
by: Dang, Ronghao, et al.
Published: (2025)
by: Dang, Ronghao, et al.
Published: (2025)
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective
by: Zhu, Yongxin, et al.
Published: (2024)
by: Zhu, Yongxin, et al.
Published: (2024)
S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
by: Xu, Beining, et al.
Published: (2025)
by: Xu, Beining, et al.
Published: (2025)
Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes
by: Ling, Chen, et al.
Published: (2026)
by: Ling, Chen, et al.
Published: (2026)
Thinking Ahead: Foresight Intelligence in MLLMs and World Models
by: Gong, Zhantao, et al.
Published: (2025)
by: Gong, Zhantao, et al.
Published: (2025)
Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives
by: Jiang, Kai, et al.
Published: (2025)
by: Jiang, Kai, et al.
Published: (2025)
LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning
by: Hao, Haihong, et al.
Published: (2026)
by: Hao, Haihong, et al.
Published: (2026)
MLLMs-Augmented Visual-Language Representation Learning
by: Liu, Yanqing, et al.
Published: (2023)
by: Liu, Yanqing, et al.
Published: (2023)
A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs
by: Dang, Yunkai, et al.
Published: (2025)
by: Dang, Yunkai, et al.
Published: (2025)
InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs
by: Tang, Lv, et al.
Published: (2026)
by: Tang, Lv, et al.
Published: (2026)
VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
by: Jiang, Pengfei, et al.
Published: (2025)
by: Jiang, Pengfei, et al.
Published: (2025)
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
by: Xie, Yuxuan, et al.
Published: (2024)
by: Xie, Yuxuan, et al.
Published: (2024)
Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization
by: Zhang, Xiang, et al.
Published: (2025)
by: Zhang, Xiang, et al.
Published: (2025)
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
by: Kang, Caixin, et al.
Published: (2026)
by: Kang, Caixin, et al.
Published: (2026)
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
by: Zhou, Jiazhou, et al.
Published: (2026)
by: Zhou, Jiazhou, et al.
Published: (2026)
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
by: Sun, Shichu, et al.
Published: (2025)
by: Sun, Shichu, et al.
Published: (2025)
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans
by: Qiu, Yansheng, et al.
Published: (2025)
by: Qiu, Yansheng, et al.
Published: (2025)
SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass
by: Qian, Chen, et al.
Published: (2026)
by: Qian, Chen, et al.
Published: (2026)
IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs
by: Tan, Yifan, et al.
Published: (2026)
by: Tan, Yifan, et al.
Published: (2026)
OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
by: Zhang, Xu, et al.
Published: (2026)
by: Zhang, Xu, et al.
Published: (2026)
OpenMoCap: Rethinking Optical Motion Capture under Real-world Occlusion
by: Qian, Chen, et al.
Published: (2025)
by: Qian, Chen, et al.
Published: (2025)
RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos
by: Yang, Zixi, et al.
Published: (2025)
by: Yang, Zixi, et al.
Published: (2025)
Similar Items
-
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
by: Liu, Xiaolin, et al.
Published: (2026) -
VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs
by: Li, Qiaoru, et al.
Published: (2026) -
Dense Connector for MLLMs
by: Yao, Huanjin, et al.
Published: (2024) -
Compress3D: a Compressed Latent Space for 3D Generation from a Single Image
by: Zhang, Bowen, et al.
Published: (2024) -
Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
by: Li, Yuanshuai, et al.
Published: (2025)