:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Pan, Jun-Yu, Wang, Yansen, Zhang, Enze, Lu, Bao-Liang, Zheng, Wei-Long, Li, Dongsheng
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.18172
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

A Large-scale Medical Visual Task Adaptation Benchmark
by: Mo, Shentong, et al.
Published: (2024)

LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning
by: Mo, Shentong, et al.
Published: (2024)

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
by: Li, Caorui, et al.
Published: (2025)

NeuroLM: A Universal Multi-task Foundation Model for Bridging the Gap between Language and EEG Signals
by: Jiang, Wei-Bang, et al.
Published: (2024)

State-Action Inpainting Diffuser for Continuous Control with Delay
by: Han, Dongqi, et al.
Published: (2026)

How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images
by: Liu, Guimeng, et al.
Published: (2026)

EEGFormer: Towards Transferable and Interpretable Large-Scale EEG Foundation Model
by: Chen, Yuqi, et al.
Published: (2024)

INVIGORATE: Interactive Visual Grounding and Grasping in Clutter
by: Zhang, Hanbo, et al.
Published: (2021)

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
by: Liu, Ziyu, et al.
Published: (2024)

Diffusion-CAM: Faithful Visual Explanations for dMLLMs
by: Zuo, Haomin, et al.
Published: (2026)

Generating by Understanding: Neural Visual Generation with Logical Symbol Groundings
by: Peng, Yifei, et al.
Published: (2023)

Tape: A Cellular Automata Benchmark for Evaluating Rule-Shift Generalization in Reinforcement Learning
by: Pan, Enze
Published: (2026)

MLLMs-Augmented Visual-Language Representation Learning
by: Liu, Yanqing, et al.
Published: (2023)

VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
by: Zheng, Naishan, et al.
Published: (2025)

CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
by: Zhang, Yang, et al.
Published: (2026)

CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
by: Zhang, Situo, et al.
Published: (2026)

Towards Understanding Visual Grounding in Visual Language Models
by: Pantazopoulos, Georgios, et al.
Published: (2025)

PEACE: Empowering Geologic Map Holistic Understanding with MLLMs
by: Huang, Yangyu, et al.
Published: (2025)

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
by: Yan, Hao, et al.
Published: (2026)

Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
by: Li, Yuanshuai, et al.
Published: (2025)

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
by: Huang, Jincai, et al.
Published: (2026)

EgoBrain: Synergizing Minds and Eyes For Human Action Understanding
by: Lin, Nie, et al.
Published: (2025)

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
by: AI, Inclusion, et al.
Published: (2025)

S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
by: Xu, Beining, et al.
Published: (2025)

IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
by: Jiang, Yankai, et al.
Published: (2026)

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
by: Zhu, Chenming, et al.
Published: (2024)

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs
by: Barrios, Wayner, et al.
Published: (2025)

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
by: Heng, Yongrui, et al.
Published: (2026)

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
by: Zeng, Xiangyu, et al.
Published: (2024)

Visual Position Prompt for MLLM based Visual Grounding
by: Tang, Wei, et al.
Published: (2025)

Boosting Private Domain Understanding of Efficient MLLMs: A Tuning-free, Adaptive, Universal Prompt Optimization Framework
by: Liu, Jiang, et al.
Published: (2024)

Visual Grounding for Object-Level Generalization in Reinforcement Learning
by: Jiang, Haobin, et al.
Published: (2024)

ContiFormer: Continuous-Time Transformer for Irregular Time Series Modeling
by: Chen, Yuqi, et al.
Published: (2024)

RASP-Tuner: Retrieval-Augmented Soft Prompts for Context-Aware Black-Box Optimization in Non-Stationary Environments
by: Pan, Enze
Published: (2026)

AdaCodec: A Predictive Visual Code for Video MLLMs
by: Hou, Haowen, et al.
Published: (2026)

Autoregressive Visual Decoding from EEG Signals
by: Dai, Sicheng, et al.
Published: (2026)

VGR: Visual Grounded Reasoning
by: Wang, Jiacong, et al.
Published: (2025)

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery
by: Wang, Fengxiang, et al.
Published: (2026)

FairReason: Balancing Reasoning and Social Bias in MLLMs
by: Pan, Zhenyu, et al.
Published: (2025)

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation
by: Lu, Shuo, et al.
Published: (2026)