:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sharma, Aditya, Yoffe, Luke, Höllerer, Tobias
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2401.08973
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Open-Vocabulary Object Detection via Language Hierarchy
by: Huang, Jiaxing, et al.
Published: (2024)

Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR
by: Guo, Lixing, et al.
Published: (2025)

HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection
by: Ma, Yuqi, et al.
Published: (2024)

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
by: Barsellotti, Luca, et al.
Published: (2024)

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
by: Ma, Ziqiao, et al.
Published: (2023)

Catch Me If You Can Describe Me: Open-Vocabulary Camouflaged Instance Segmentation with Diffusion
by: Vu, Tuan-Anh, et al.
Published: (2023)

Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
by: Sharma, Aditya, et al.
Published: (2024)

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion
by: Allgeuer, Philipp, et al.
Published: (2024)

MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants
by: Bansal, Hritik, et al.
Published: (2024)

MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation
by: Yao, Jihan, et al.
Published: (2025)

Sampling Bag of Views for Open-Vocabulary Object Detection
by: Choi, Hojun, et al.
Published: (2024)

VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking
by: Qian, Zekun, et al.
Published: (2024)

Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
by: Guillen-Perez, Antonio
Published: (2025)

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
by: Saxon, Michael, et al.
Published: (2024)

Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention
by: Juanico, Drandreb Earl O., et al.
Published: (2025)

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
by: Li, Zizhao, et al.
Published: (2024)

Adaptive 3D UI Placement in Mixed Reality Using Deep Reinforcement Learning
by: Lu, Feiyu, et al.
Published: (2025)

Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions
by: Surikuchi, Aditya K, et al.
Published: (2025)

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
by: Werby, Abdelrhman, et al.
Published: (2024)

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding
by: De, Anik, et al.
Published: (2025)

DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning
by: Matsuda, Kazuki, et al.
Published: (2024)

GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video
by: Sharma, Arun
Published: (2026)

Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph
by: Linok, Sergey, et al.
Published: (2024)

Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments
by: Etchegaray, Djamahl, et al.
Published: (2024)

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision
by: Wang, Junjie, et al.
Published: (2024)

WoMAP: World Models For Embodied Open-Vocabulary Object Localization
by: Yin, Tenny, et al.
Published: (2025)

OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding
by: Deng, Yinan, et al.
Published: (2024)

Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments
by: Surikuchi, Aditya K, et al.
Published: (2026)

GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
by: Shao, Yawen, et al.
Published: (2024)

LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation
by: Zhou, Yang, et al.
Published: (2025)

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
by: Huang, Kung-Hsiang, et al.
Published: (2024)

Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach
by: Sakajo, Haruki, et al.
Published: (2025)

Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models
by: Zhu, Wanrong, et al.
Published: (2024)

Automatic benchmarking of large multimodal models via iterative experiment programming
by: Conti, Alessandro, et al.
Published: (2024)

Towards Open Vocabulary Learning: A Survey
by: Wu, Jianzong, et al.
Published: (2023)

Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models
by: Zhang, Huixuan, et al.
Published: (2025)

A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs
by: Broomfield, Julius, et al.
Published: (2025)

Mitigating Open-Vocabulary Caption Hallucinations
by: Ben-Kish, Assaf, et al.
Published: (2023)

NeoBabel: A Multilingual Open Tower for Visual Generation
by: Derakhshani, Mohammad Mahdi, et al.
Published: (2025)

Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models
by: Ohi, Masanari, et al.
Published: (2024)