Saved in:
| Main Authors: | Willemsen, Bram, Skantze, Gabriel |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.05721 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models
by: Willemsen, Bram, et al.
Published: (2025)
by: Willemsen, Bram, et al.
Published: (2025)
ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension
by: Hu, Yizhi, et al.
Published: (2025)
by: Hu, Yizhi, et al.
Published: (2025)
Exploring Spatial Language Grounding Through Referring Expressions
by: Tumu, Akshar, et al.
Published: (2025)
by: Tumu, Akshar, et al.
Published: (2025)
Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
by: Tumu, Akshar, et al.
Published: (2025)
by: Tumu, Akshar, et al.
Published: (2025)
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval
by: Shen, Li-Cheng, et al.
Published: (2025)
by: Shen, Li-Cheng, et al.
Published: (2025)
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
by: Dong, Qihua, et al.
Published: (2026)
by: Dong, Qihua, et al.
Published: (2026)
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
by: Liu, Ting, et al.
Published: (2024)
by: Liu, Ting, et al.
Published: (2024)
CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension
by: Zhang, Zhi, et al.
Published: (2023)
by: Zhang, Zhi, et al.
Published: (2023)
Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback
by: Li, Aiden Yiliu, et al.
Published: (2025)
by: Li, Aiden Yiliu, et al.
Published: (2025)
The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue
by: Hakimov, Sherzod, et al.
Published: (2026)
by: Hakimov, Sherzod, et al.
Published: (2026)
Latent Expression Generation for Referring Image Segmentation and Grounding
by: Yu, Seonghoon, et al.
Published: (2025)
by: Yu, Seonghoon, et al.
Published: (2025)
VGR: Visual Grounded Reasoning
by: Wang, Jiacong, et al.
Published: (2025)
by: Wang, Jiacong, et al.
Published: (2025)
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
by: Lin, Haokun, et al.
Published: (2025)
by: Lin, Haokun, et al.
Published: (2025)
Referring Expression Comprehension for Small Objects
by: Goto, Kanoko, et al.
Published: (2025)
by: Goto, Kanoko, et al.
Published: (2025)
Efficient Adaptation For Remote Sensing Visual Grounding
by: Moughnieh, Hasan, et al.
Published: (2025)
by: Moughnieh, Hasan, et al.
Published: (2025)
The Role of Entropy in Visual Grounding: Analysis and Optimization
by: Li, Shuo, et al.
Published: (2025)
by: Li, Shuo, et al.
Published: (2025)
SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation
by: Chen, Yi-Chia, et al.
Published: (2024)
by: Chen, Yi-Chia, et al.
Published: (2024)
Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues
by: Kim, Youngmin, et al.
Published: (2025)
by: Kim, Youngmin, et al.
Published: (2025)
Vero: An Open RL Recipe for General Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2026)
by: Sarch, Gabriel, et al.
Published: (2026)
Visual Grounding Methods for VQA are Working for the Wrong Reasons!
by: Shrestha, Robik, et al.
Published: (2020)
by: Shrestha, Robik, et al.
Published: (2020)
Can VLMs Recall Factual Associations From Visual References?
by: Ashok, Dhananjay, et al.
Published: (2025)
by: Ashok, Dhananjay, et al.
Published: (2025)
Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs
by: Ghosh, Sreyan, et al.
Published: (2024)
by: Ghosh, Sreyan, et al.
Published: (2024)
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
by: Wu, Qianhui, et al.
Published: (2025)
by: Wu, Qianhui, et al.
Published: (2025)
Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
by: Vaishnav, Mohit, et al.
Published: (2026)
by: Vaishnav, Mohit, et al.
Published: (2026)
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
by: Wang, Haochen, et al.
Published: (2025)
by: Wang, Haochen, et al.
Published: (2025)
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
by: Zhu, Chenming, et al.
Published: (2024)
by: Zhu, Chenming, et al.
Published: (2024)
DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams
by: Iyengar, Anirudh Iyengar Kaniyar Narayana, et al.
Published: (2026)
by: Iyengar, Anirudh Iyengar Kaniyar Narayana, et al.
Published: (2026)
Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models
by: Qiu, Yifu, et al.
Published: (2025)
by: Qiu, Yifu, et al.
Published: (2025)
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
by: Acuna, David, et al.
Published: (2025)
by: Acuna, David, et al.
Published: (2025)
Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video
by: Li, Bin, et al.
Published: (2022)
by: Li, Bin, et al.
Published: (2022)
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding
by: Zhao, Haoyu, et al.
Published: (2024)
by: Zhao, Haoyu, et al.
Published: (2024)
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
by: Gou, Boyu, et al.
Published: (2024)
by: Gou, Boyu, et al.
Published: (2024)
Watch Before You Answer: Learning from Visually Grounded Post-Training
by: Zhang, Yuxuan, et al.
Published: (2026)
by: Zhang, Yuxuan, et al.
Published: (2026)
Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs
by: Mishra, Sandeep, et al.
Published: (2026)
by: Mishra, Sandeep, et al.
Published: (2026)
KARL: Knowledge-Aware Reasoning and Reinforcement Learning for Knowledge-Intensive Visual Grounding
by: Ma, Xinyu, et al.
Published: (2025)
by: Ma, Xinyu, et al.
Published: (2025)
AV-Dialog: Spoken Dialogue Models with Audio-Visual Input
by: Chen, Tuochao, et al.
Published: (2025)
by: Chen, Tuochao, et al.
Published: (2025)
Seeing Culture: A Benchmark for Visual Reasoning and Grounding
by: Satar, Burak, et al.
Published: (2025)
by: Satar, Burak, et al.
Published: (2025)
VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
by: Li, Zejun, et al.
Published: (2024)
by: Li, Zejun, et al.
Published: (2024)
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
by: Parmar, Paritosh, et al.
Published: (2024)
by: Parmar, Paritosh, et al.
Published: (2024)
VIMI: Grounding Video Generation through Multi-modal Instruction
by: Fang, Yuwei, et al.
Published: (2024)
by: Fang, Yuwei, et al.
Published: (2024)
Similar Items
-
Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models
by: Willemsen, Bram, et al.
Published: (2025) -
ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension
by: Hu, Yizhi, et al.
Published: (2025) -
Exploring Spatial Language Grounding Through Referring Expressions
by: Tumu, Akshar, et al.
Published: (2025) -
Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
by: Tumu, Akshar, et al.
Published: (2025) -
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval
by: Shen, Li-Cheng, et al.
Published: (2025)