:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Nortje, Leanne
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2409.02865
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings
by: Nortje, Leanne, et al.
Published: (2024)

Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models
by: Khorrami, Khazar, et al.
Published: (2021)

Grounding Language Models for Visual Entity Recognition
by: Xiao, Zilin, et al.
Published: (2024)

Visually Grounded Speech Models have a Mutual Exclusivity Bias
by: Nortje, Leanne, et al.
Published: (2024)

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine
by: Luo, Lingxiao, et al.
Published: (2024)

Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings
by: Huemann, Zachary, et al.
Published: (2025)

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction
by: Teng, Matthew Kit Khinn, et al.
Published: (2025)

GroundingGPT:Language Enhanced Multi-modal Grounding Model
by: Li, Zhaowei, et al.
Published: (2024)

The Mechanistic Emergence of Symbol Grounding in Language Models
by: Wu, Shuyu, et al.
Published: (2025)

Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models
by: Qiu, Yifu, et al.
Published: (2025)

Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment
by: Xu, Xiaoxu, et al.
Published: (2023)

NAVCON: A Cognitively Inspired and Linguistically Grounded Corpus for Vision and Language Navigation
by: Wanchoo, Karan, et al.
Published: (2024)

Tailored Design of Audio-Visual Speech Recognition Models using Branchformers
by: Gimeno-Gómez, David, et al.
Published: (2024)

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding
by: Zhao, Haoyu, et al.
Published: (2024)

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
by: Rajabi, Navid, et al.
Published: (2023)

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment
by: Li, Yunxin, et al.
Published: (2024)

Generalizable Entity Grounding via Assistance of Large Language Model
by: Qi, Lu, et al.
Published: (2024)

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
by: Chen, Jiaxing, et al.
Published: (2024)

Show and Guide: Instructional-Plan Grounded Vision and Language Model
by: Glória-Silva, Diogo, et al.
Published: (2024)

Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions
by: Li, Jun, et al.
Published: (2025)

Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models
by: Lu, Jiaying, et al.
Published: (2023)

Visual Representations inside the Language Model
by: Liu, Benlin, et al.
Published: (2025)

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
by: Xiao, Han, et al.
Published: (2025)

Why are Visually-Grounded Language Models Bad at Image Classification?
by: Zhang, Yuhui, et al.
Published: (2024)

Towards Visual Text Grounding of Multimodal Large Language Model
by: Li, Ming, et al.
Published: (2025)

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding
by: Yoon, Hee Suk, et al.
Published: (2026)

Clean Evaluations on Contaminated Visual Language Models
by: Lu, Hongyuan, et al.
Published: (2024)

Do Vision-Language Models Really Understand Visual Language?
by: Hou, Yifan, et al.
Published: (2024)

Visually grounded few-shot word learning in low-resource settings
by: Nortje, Leanne, et al.
Published: (2023)

Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models
by: Wang, Tianyu, et al.
Published: (2024)

SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs
by: Wang, Siting, et al.
Published: (2025)

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
by: Wang, Yuxuan, et al.
Published: (2024)

Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?
by: Geigle, Gregor, et al.
Published: (2024)

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
by: Gröpl, Marcel, et al.
Published: (2026)

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
by: Kojima, Noriyuki, et al.
Published: (2023)

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models
by: Biswas, Shristi Das, et al.
Published: (2026)

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
by: Hu, Yushi, et al.
Published: (2024)

Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
by: Ma, Chuofan, et al.
Published: (2024)

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
by: Prasad, Archiki, et al.
Published: (2023)

First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models
by: Ha, Jiwoo, et al.
Published: (2026)