Saved in:
| Main Authors: | Pan, Jun-Yu, Wang, Yansen, Zhang, Enze, Lu, Bao-Liang, Zheng, Wei-Long, Li, Dongsheng |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.18172 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Large-scale Medical Visual Task Adaptation Benchmark
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
by: Li, Caorui, et al.
Published: (2025)
by: Li, Caorui, et al.
Published: (2025)
NeuroLM: A Universal Multi-task Foundation Model for Bridging the Gap between Language and EEG Signals
by: Jiang, Wei-Bang, et al.
Published: (2024)
by: Jiang, Wei-Bang, et al.
Published: (2024)
State-Action Inpainting Diffuser for Continuous Control with Delay
by: Han, Dongqi, et al.
Published: (2026)
by: Han, Dongqi, et al.
Published: (2026)
How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images
by: Liu, Guimeng, et al.
Published: (2026)
by: Liu, Guimeng, et al.
Published: (2026)
EEGFormer: Towards Transferable and Interpretable Large-Scale EEG Foundation Model
by: Chen, Yuqi, et al.
Published: (2024)
by: Chen, Yuqi, et al.
Published: (2024)
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter
by: Zhang, Hanbo, et al.
Published: (2021)
by: Zhang, Hanbo, et al.
Published: (2021)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition
by: Liu, Ziyu, et al.
Published: (2024)
by: Liu, Ziyu, et al.
Published: (2024)
Diffusion-CAM: Faithful Visual Explanations for dMLLMs
by: Zuo, Haomin, et al.
Published: (2026)
by: Zuo, Haomin, et al.
Published: (2026)
Generating by Understanding: Neural Visual Generation with Logical Symbol Groundings
by: Peng, Yifei, et al.
Published: (2023)
by: Peng, Yifei, et al.
Published: (2023)
Tape: A Cellular Automata Benchmark for Evaluating Rule-Shift Generalization in Reinforcement Learning
by: Pan, Enze
Published: (2026)
by: Pan, Enze
Published: (2026)
MLLMs-Augmented Visual-Language Representation Learning
by: Liu, Yanqing, et al.
Published: (2023)
by: Liu, Yanqing, et al.
Published: (2023)
VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
by: Zheng, Naishan, et al.
Published: (2025)
by: Zheng, Naishan, et al.
Published: (2025)
CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
by: Zhang, Yang, et al.
Published: (2026)
by: Zhang, Yang, et al.
Published: (2026)
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
by: Zhang, Situo, et al.
Published: (2026)
by: Zhang, Situo, et al.
Published: (2026)
Towards Understanding Visual Grounding in Visual Language Models
by: Pantazopoulos, Georgios, et al.
Published: (2025)
by: Pantazopoulos, Georgios, et al.
Published: (2025)
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs
by: Huang, Yangyu, et al.
Published: (2025)
by: Huang, Yangyu, et al.
Published: (2025)
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
by: Yan, Hao, et al.
Published: (2026)
by: Yan, Hao, et al.
Published: (2026)
Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
by: Li, Yuanshuai, et al.
Published: (2025)
by: Li, Yuanshuai, et al.
Published: (2025)
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
by: Huang, Jincai, et al.
Published: (2026)
by: Huang, Jincai, et al.
Published: (2026)
EgoBrain: Synergizing Minds and Eyes For Human Action Understanding
by: Lin, Nie, et al.
Published: (2025)
by: Lin, Nie, et al.
Published: (2025)
M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
by: AI, Inclusion, et al.
Published: (2025)
by: AI, Inclusion, et al.
Published: (2025)
S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
by: Xu, Beining, et al.
Published: (2025)
by: Xu, Beining, et al.
Published: (2025)
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
by: Jiang, Yankai, et al.
Published: (2026)
by: Jiang, Yankai, et al.
Published: (2026)
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
by: Zhu, Chenming, et al.
Published: (2024)
by: Zhu, Chenming, et al.
Published: (2024)
MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs
by: Barrios, Wayner, et al.
Published: (2025)
by: Barrios, Wayner, et al.
Published: (2025)
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
by: Heng, Yongrui, et al.
Published: (2026)
by: Heng, Yongrui, et al.
Published: (2026)
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
by: Zeng, Xiangyu, et al.
Published: (2024)
by: Zeng, Xiangyu, et al.
Published: (2024)
Visual Position Prompt for MLLM based Visual Grounding
by: Tang, Wei, et al.
Published: (2025)
by: Tang, Wei, et al.
Published: (2025)
Boosting Private Domain Understanding of Efficient MLLMs: A Tuning-free, Adaptive, Universal Prompt Optimization Framework
by: Liu, Jiang, et al.
Published: (2024)
by: Liu, Jiang, et al.
Published: (2024)
Visual Grounding for Object-Level Generalization in Reinforcement Learning
by: Jiang, Haobin, et al.
Published: (2024)
by: Jiang, Haobin, et al.
Published: (2024)
ContiFormer: Continuous-Time Transformer for Irregular Time Series Modeling
by: Chen, Yuqi, et al.
Published: (2024)
by: Chen, Yuqi, et al.
Published: (2024)
RASP-Tuner: Retrieval-Augmented Soft Prompts for Context-Aware Black-Box Optimization in Non-Stationary Environments
by: Pan, Enze
Published: (2026)
by: Pan, Enze
Published: (2026)
AdaCodec: A Predictive Visual Code for Video MLLMs
by: Hou, Haowen, et al.
Published: (2026)
by: Hou, Haowen, et al.
Published: (2026)
Autoregressive Visual Decoding from EEG Signals
by: Dai, Sicheng, et al.
Published: (2026)
by: Dai, Sicheng, et al.
Published: (2026)
VGR: Visual Grounded Reasoning
by: Wang, Jiacong, et al.
Published: (2025)
by: Wang, Jiacong, et al.
Published: (2025)
GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery
by: Wang, Fengxiang, et al.
Published: (2026)
by: Wang, Fengxiang, et al.
Published: (2026)
FairReason: Balancing Reasoning and Social Bias in MLLMs
by: Pan, Zhenyu, et al.
Published: (2025)
by: Pan, Zhenyu, et al.
Published: (2025)
Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation
by: Lu, Shuo, et al.
Published: (2026)
by: Lu, Shuo, et al.
Published: (2026)
Similar Items
-
A Large-scale Medical Visual Task Adaptation Benchmark
by: Mo, Shentong, et al.
Published: (2024) -
LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning
by: Mo, Shentong, et al.
Published: (2024) -
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
by: Li, Caorui, et al.
Published: (2025) -
NeuroLM: A Universal Multi-task Foundation Model for Bridging the Gap between Language and EEG Signals
by: Jiang, Wei-Bang, et al.
Published: (2024) -
State-Action Inpainting Diffuser for Continuous Control with Delay
by: Han, Dongqi, et al.
Published: (2026)