:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Willemsen, Bram, Skantze, Gabriel
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2409.05721
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models
by: Willemsen, Bram, et al.
Published: (2025)

ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension
by: Hu, Yizhi, et al.
Published: (2025)

Exploring Spatial Language Grounding Through Referring Expressions
by: Tumu, Akshar, et al.
Published: (2025)

Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
by: Tumu, Akshar, et al.
Published: (2025)

Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval
by: Shen, Li-Cheng, et al.
Published: (2025)

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
by: Dong, Qihua, et al.
Published: (2026)

MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
by: Liu, Ting, et al.
Published: (2024)

CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension
by: Zhang, Zhi, et al.
Published: (2023)

Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback
by: Li, Aiden Yiliu, et al.
Published: (2025)

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue
by: Hakimov, Sherzod, et al.
Published: (2026)

Latent Expression Generation for Referring Image Segmentation and Grounding
by: Yu, Seonghoon, et al.
Published: (2025)

VGR: Visual Grounded Reasoning
by: Wang, Jiacong, et al.
Published: (2025)

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
by: Lin, Haokun, et al.
Published: (2025)

Referring Expression Comprehension for Small Objects
by: Goto, Kanoko, et al.
Published: (2025)

Efficient Adaptation For Remote Sensing Visual Grounding
by: Moughnieh, Hasan, et al.
Published: (2025)

The Role of Entropy in Visual Grounding: Analysis and Optimization
by: Li, Shuo, et al.
Published: (2025)

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation
by: Chen, Yi-Chia, et al.
Published: (2024)

Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues
by: Kim, Youngmin, et al.
Published: (2025)

Vero: An Open RL Recipe for General Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2026)

Visual Grounding Methods for VQA are Working for the Wrong Reasons!
by: Shrestha, Robik, et al.
Published: (2020)

Can VLMs Recall Factual Associations From Visual References?
by: Ashok, Dhananjay, et al.
Published: (2025)

Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs
by: Ghosh, Sreyan, et al.
Published: (2024)

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
by: Wu, Qianhui, et al.
Published: (2025)

Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
by: Vaishnav, Mohit, et al.
Published: (2026)

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
by: Wang, Haochen, et al.
Published: (2025)

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
by: Zhu, Chenming, et al.
Published: (2024)

DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams
by: Iyengar, Anirudh Iyengar Kaniyar Narayana, et al.
Published: (2026)

Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models
by: Qiu, Yifu, et al.
Published: (2025)

Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
by: Acuna, David, et al.
Published: (2025)

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video
by: Li, Bin, et al.
Published: (2022)

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding
by: Zhao, Haoyu, et al.
Published: (2024)

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
by: Gou, Boyu, et al.
Published: (2024)

Watch Before You Answer: Learning from Visually Grounded Post-Training
by: Zhang, Yuxuan, et al.
Published: (2026)

Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs
by: Mishra, Sandeep, et al.
Published: (2026)

KARL: Knowledge-Aware Reasoning and Reinforcement Learning for Knowledge-Intensive Visual Grounding
by: Ma, Xinyu, et al.
Published: (2025)

AV-Dialog: Spoken Dialogue Models with Audio-Visual Input
by: Chen, Tuochao, et al.
Published: (2025)

Seeing Culture: A Benchmark for Visual Reasoning and Grounding
by: Satar, Burak, et al.
Published: (2025)

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
by: Li, Zejun, et al.
Published: (2024)

CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
by: Parmar, Paritosh, et al.
Published: (2024)

VIMI: Grounding Video Generation through Multi-modal Instruction
by: Fang, Yuwei, et al.
Published: (2024)