Guardado en:
| Autores principales: | Yu, Keunwoo Peter, Chai, Joyce |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2505.11326 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
por: Yu, Keunwoo Peter, et al.
Publicado: (2023)
por: Yu, Keunwoo Peter, et al.
Publicado: (2023)
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
por: Ma, Ziqiao, et al.
Publicado: (2023)
por: Ma, Ziqiao, et al.
Publicado: (2023)
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
por: Zhang, Yichi, et al.
Publicado: (2024)
por: Zhang, Yichi, et al.
Publicado: (2024)
Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
por: Xue, Haotian, et al.
Publicado: (2025)
por: Xue, Haotian, et al.
Publicado: (2025)
Multi-Object Hallucination in Vision-Language Models
por: Chen, Xuweiyi, et al.
Publicado: (2024)
por: Chen, Xuweiyi, et al.
Publicado: (2024)
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
por: Wang, Ye, et al.
Publicado: (2025)
por: Wang, Ye, et al.
Publicado: (2025)
OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
por: Gao, Hong, et al.
Publicado: (2025)
por: Gao, Hong, et al.
Publicado: (2025)
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
por: Wang, Haibo, et al.
Publicado: (2024)
por: Wang, Haibo, et al.
Publicado: (2024)
DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks
por: Zhou, Haokun, et al.
Publicado: (2024)
por: Zhou, Haokun, et al.
Publicado: (2024)
PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models
por: Guang, Jiahui, et al.
Publicado: (2026)
por: Guang, Jiahui, et al.
Publicado: (2026)
Temporal Grounding of Activities using Multimodal Large Language Models
por: Song, Young Chol
Publicado: (2024)
por: Song, Young Chol
Publicado: (2024)
How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning
por: Yang, Luyu, et al.
Publicado: (2026)
por: Yang, Luyu, et al.
Publicado: (2026)
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation
por: Yu, Shoubin, et al.
Publicado: (2025)
por: Yu, Shoubin, et al.
Publicado: (2025)
GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations
por: Chen, Boyuan, et al.
Publicado: (2026)
por: Chen, Boyuan, et al.
Publicado: (2026)
SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning
por: Perez, Alejandra, et al.
Publicado: (2026)
por: Perez, Alejandra, et al.
Publicado: (2026)
Benchmarking and Mitigating Sycophancy in Medical Vision Language Models
por: Xu, Juangui, et al.
Publicado: (2025)
por: Xu, Juangui, et al.
Publicado: (2025)
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
por: Xu, Kepeng, et al.
Publicado: (2026)
por: Xu, Kepeng, et al.
Publicado: (2026)
Physically Grounded Vision-Language Models for Robotic Manipulation
por: Gao, Jensen, et al.
Publicado: (2023)
por: Gao, Jensen, et al.
Publicado: (2023)
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves
por: Wang, Ruofan, et al.
Publicado: (2024)
por: Wang, Ruofan, et al.
Publicado: (2024)
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI
por: Ernhofer, Benjamin Raphael, et al.
Publicado: (2025)
por: Ernhofer, Benjamin Raphael, et al.
Publicado: (2025)
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
por: Kumbhar, Shrinidhi, et al.
Publicado: (2026)
por: Kumbhar, Shrinidhi, et al.
Publicado: (2026)
To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models
por: Aranya, OFM Riaz Rahman, et al.
Publicado: (2026)
por: Aranya, OFM Riaz Rahman, et al.
Publicado: (2026)
TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
por: Irvin, Jeremy Andrew, et al.
Publicado: (2024)
por: Irvin, Jeremy Andrew, et al.
Publicado: (2024)
Argus: Benchmarking and Enhancing Vision-Language Models for 3D Radiology Report Generation
por: Liu, Che, et al.
Publicado: (2024)
por: Liu, Che, et al.
Publicado: (2024)
VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark
por: Huang, Han, et al.
Publicado: (2024)
por: Huang, Han, et al.
Publicado: (2024)
Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
por: Tumu, Akshar, et al.
Publicado: (2025)
por: Tumu, Akshar, et al.
Publicado: (2025)
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
por: Wang, JiYang, et al.
Publicado: (2026)
por: Wang, JiYang, et al.
Publicado: (2026)
Benchmarking Vision Language Models for Cultural Understanding
por: Nayak, Shravan, et al.
Publicado: (2024)
por: Nayak, Shravan, et al.
Publicado: (2024)
Factorized Learning for Temporally Grounded Video-Language Models
por: Zeng, Wenzheng, et al.
Publicado: (2025)
por: Zeng, Wenzheng, et al.
Publicado: (2025)
Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation
por: Zhang, Jihai, et al.
Publicado: (2025)
por: Zhang, Jihai, et al.
Publicado: (2025)
TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos
por: Yu, Yakun, et al.
Publicado: (2026)
por: Yu, Yakun, et al.
Publicado: (2026)
On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
por: Ranjan, Mukul, et al.
Publicado: (2026)
por: Ranjan, Mukul, et al.
Publicado: (2026)
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
por: Lee, Youngwan, et al.
Publicado: (2025)
por: Lee, Youngwan, et al.
Publicado: (2025)
VLMine: Long-Tail Data Mining with Vision Language Models
por: Ye, Mao, et al.
Publicado: (2024)
por: Ye, Mao, et al.
Publicado: (2024)
STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models
por: Nguyen-Nhu, Tinh-Anh, et al.
Publicado: (2025)
por: Nguyen-Nhu, Tinh-Anh, et al.
Publicado: (2025)
3DReasonKnee: Advancing Grounded Reasoning in Medical Vision Language Models
por: Sambara, Sraavya, et al.
Publicado: (2025)
por: Sambara, Sraavya, et al.
Publicado: (2025)
Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models
por: Samin, Niamul Hassan, et al.
Publicado: (2026)
por: Samin, Niamul Hassan, et al.
Publicado: (2026)
Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models
por: Rahman, Md Ashikur, et al.
Publicado: (2026)
por: Rahman, Md Ashikur, et al.
Publicado: (2026)
HybridToken-VLM: Hybrid Token Compression for Vision-Language Models
por: Zhang, Jusheng, et al.
Publicado: (2025)
por: Zhang, Jusheng, et al.
Publicado: (2025)
CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models
por: Cai, Jie, et al.
Publicado: (2025)
por: Cai, Jie, et al.
Publicado: (2025)
Ejemplares similares
-
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
por: Yu, Keunwoo Peter, et al.
Publicado: (2023) -
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
por: Ma, Ziqiao, et al.
Publicado: (2023) -
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
por: Zhang, Yichi, et al.
Publicado: (2024) -
Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
por: Xue, Haotian, et al.
Publicado: (2025) -
Multi-Object Hallucination in Vision-Language Models
por: Chen, Xuweiyi, et al.
Publicado: (2024)