Saved in:
| Main Authors: | Xu, Kepeng, Xu, Li, He, Gang, Yu, Wenxin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.11727 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration
by: Wang, Jun, et al.
Published: (2026)
by: Wang, Jun, et al.
Published: (2026)
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
by: Wang, Ye, et al.
Published: (2025)
by: Wang, Ye, et al.
Published: (2025)
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
by: Hu, Wenbo, et al.
Published: (2025)
by: Hu, Wenbo, et al.
Published: (2025)
Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
by: Tumu, Akshar, et al.
Published: (2025)
by: Tumu, Akshar, et al.
Published: (2025)
HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task
by: Tian, Yu, et al.
Published: (2024)
by: Tian, Yu, et al.
Published: (2024)
Can Large Vision-Language Models Detect Images Copyright Infringement from GenAI?
by: Xu, Qipan, et al.
Published: (2025)
by: Xu, Qipan, et al.
Published: (2025)
Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments
by: Li, Zerui, et al.
Published: (2025)
by: Li, Zerui, et al.
Published: (2025)
Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models
by: Qiu, Yifu, et al.
Published: (2025)
by: Qiu, Yifu, et al.
Published: (2025)
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
by: Padhan, Swagat, et al.
Published: (2026)
by: Padhan, Swagat, et al.
Published: (2026)
Learning Language Structures through Grounding
by: Shi, Freda
Published: (2024)
by: Shi, Freda
Published: (2024)
VLind-Bench: Measuring Language Priors in Large Vision-Language Models
by: Lee, Kang-il, et al.
Published: (2024)
by: Lee, Kang-il, et al.
Published: (2024)
SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model
by: Deng, Guifeng, et al.
Published: (2026)
by: Deng, Guifeng, et al.
Published: (2026)
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
by: Jia, Mengdi, et al.
Published: (2025)
by: Jia, Mengdi, et al.
Published: (2025)
Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos
by: Chen, Haodong, et al.
Published: (2026)
by: Chen, Haodong, et al.
Published: (2026)
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
by: Jia, Baoxiong, et al.
Published: (2024)
by: Jia, Baoxiong, et al.
Published: (2024)
VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
by: Xiao, Wenyi, et al.
Published: (2026)
by: Xiao, Wenyi, et al.
Published: (2026)
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
by: Li, You, et al.
Published: (2025)
by: Li, You, et al.
Published: (2025)
What Limits Vision-and-Language Navigation ?
by: Wang, Yunheng, et al.
Published: (2026)
by: Wang, Yunheng, et al.
Published: (2026)
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
by: Ma, Ziqiao, et al.
Published: (2023)
by: Ma, Ziqiao, et al.
Published: (2023)
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
by: Tu, Shangqing, et al.
Published: (2025)
by: Tu, Shangqing, et al.
Published: (2025)
Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling
by: Chen, Qiyuan, et al.
Published: (2026)
by: Chen, Qiyuan, et al.
Published: (2026)
Multi-Object Hallucination in Vision-Language Models
by: Chen, Xuweiyi, et al.
Published: (2024)
by: Chen, Xuweiyi, et al.
Published: (2024)
Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models
by: Xu, Shicheng, et al.
Published: (2024)
by: Xu, Shicheng, et al.
Published: (2024)
First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models
by: Ha, Jiwoo, et al.
Published: (2026)
by: Ha, Jiwoo, et al.
Published: (2026)
Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models
by: Zhong, Weihong, et al.
Published: (2024)
by: Zhong, Weihong, et al.
Published: (2024)
Scaling Language-Centric Omnimodal Representation Learning
by: Xiao, Chenghao, et al.
Published: (2025)
by: Xiao, Chenghao, et al.
Published: (2025)
Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
by: He, Lixuan, et al.
Published: (2025)
by: He, Lixuan, et al.
Published: (2025)
Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers
by: Rao, Xuan, et al.
Published: (2025)
by: Rao, Xuan, et al.
Published: (2025)
Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations
by: Li, Yanshu
Published: (2025)
by: Li, Yanshu
Published: (2025)
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
by: Xu, Boshen, et al.
Published: (2025)
by: Xu, Boshen, et al.
Published: (2025)
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
by: Yang, Senqiao, et al.
Published: (2025)
by: Yang, Senqiao, et al.
Published: (2025)
Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback
by: Li, Aiden Yiliu, et al.
Published: (2025)
by: Li, Aiden Yiliu, et al.
Published: (2025)
Correctable Landmark Discovery via Large Models for Vision-Language Navigation
by: Lin, Bingqian, et al.
Published: (2024)
by: Lin, Bingqian, et al.
Published: (2024)
MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models
by: Yan, Qiao, et al.
Published: (2025)
by: Yan, Qiao, et al.
Published: (2025)
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
by: Yu, Keunwoo Peter, et al.
Published: (2023)
by: Yu, Keunwoo Peter, et al.
Published: (2023)
Structured Preference Optimization for Vision-Language Long-Horizon Task Planning
by: Liang, Xiwen, et al.
Published: (2025)
by: Liang, Xiwen, et al.
Published: (2025)
Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues
by: Kim, Youngmin, et al.
Published: (2025)
by: Kim, Youngmin, et al.
Published: (2025)
Factorized Learning for Temporally Grounded Video-Language Models
by: Zeng, Wenzheng, et al.
Published: (2025)
by: Zeng, Wenzheng, et al.
Published: (2025)
GutenOCR: A Grounded Vision-Language Front-End for Documents
by: Heidenreich, Hunter, et al.
Published: (2026)
by: Heidenreich, Hunter, et al.
Published: (2026)
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
by: Prasad, Archiki, et al.
Published: (2023)
by: Prasad, Archiki, et al.
Published: (2023)
Similar Items
-
Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration
by: Wang, Jun, et al.
Published: (2026) -
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
by: Wang, Ye, et al.
Published: (2025) -
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
by: Hu, Wenbo, et al.
Published: (2025) -
Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
by: Tumu, Akshar, et al.
Published: (2025) -
HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task
by: Tian, Yu, et al.
Published: (2024)