Saved in:
| Main Authors: | Sharma, Aditya, Yoffe, Luke, Höllerer, Tobias |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.08973 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Open-Vocabulary Object Detection via Language Hierarchy
by: Huang, Jiaxing, et al.
Published: (2024)
by: Huang, Jiaxing, et al.
Published: (2024)
Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR
by: Guo, Lixing, et al.
Published: (2025)
by: Guo, Lixing, et al.
Published: (2025)
HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection
by: Ma, Yuqi, et al.
Published: (2024)
by: Ma, Yuqi, et al.
Published: (2024)
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
by: Barsellotti, Luca, et al.
Published: (2024)
by: Barsellotti, Luca, et al.
Published: (2024)
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
by: Ma, Ziqiao, et al.
Published: (2023)
by: Ma, Ziqiao, et al.
Published: (2023)
Catch Me If You Can Describe Me: Open-Vocabulary Camouflaged Instance Segmentation with Diffusion
by: Vu, Tuan-Anh, et al.
Published: (2023)
by: Vu, Tuan-Anh, et al.
Published: (2023)
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
by: Sharma, Aditya, et al.
Published: (2024)
by: Sharma, Aditya, et al.
Published: (2024)
Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion
by: Allgeuer, Philipp, et al.
Published: (2024)
by: Allgeuer, Philipp, et al.
Published: (2024)
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants
by: Bansal, Hritik, et al.
Published: (2024)
by: Bansal, Hritik, et al.
Published: (2024)
MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation
by: Yao, Jihan, et al.
Published: (2025)
by: Yao, Jihan, et al.
Published: (2025)
Sampling Bag of Views for Open-Vocabulary Object Detection
by: Choi, Hojun, et al.
Published: (2024)
by: Choi, Hojun, et al.
Published: (2024)
VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking
by: Qian, Zekun, et al.
Published: (2024)
by: Qian, Zekun, et al.
Published: (2024)
Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
by: Guillen-Perez, Antonio
Published: (2025)
by: Guillen-Perez, Antonio
Published: (2025)
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
by: Saxon, Michael, et al.
Published: (2024)
by: Saxon, Michael, et al.
Published: (2024)
Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention
by: Juanico, Drandreb Earl O., et al.
Published: (2025)
by: Juanico, Drandreb Earl O., et al.
Published: (2025)
From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
by: Li, Zizhao, et al.
Published: (2024)
by: Li, Zizhao, et al.
Published: (2024)
Adaptive 3D UI Placement in Mixed Reality Using Deep Reinforcement Learning
by: Lu, Feiyu, et al.
Published: (2025)
by: Lu, Feiyu, et al.
Published: (2025)
Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions
by: Surikuchi, Aditya K, et al.
Published: (2025)
by: Surikuchi, Aditya K, et al.
Published: (2025)
Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
by: Werby, Abdelrhman, et al.
Published: (2024)
by: Werby, Abdelrhman, et al.
Published: (2024)
Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding
by: De, Anik, et al.
Published: (2025)
by: De, Anik, et al.
Published: (2025)
DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning
by: Matsuda, Kazuki, et al.
Published: (2024)
by: Matsuda, Kazuki, et al.
Published: (2024)
GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video
by: Sharma, Arun
Published: (2026)
by: Sharma, Arun
Published: (2026)
Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph
by: Linok, Sergey, et al.
Published: (2024)
by: Linok, Sergey, et al.
Published: (2024)
Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments
by: Etchegaray, Djamahl, et al.
Published: (2024)
by: Etchegaray, Djamahl, et al.
Published: (2024)
OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision
by: Wang, Junjie, et al.
Published: (2024)
by: Wang, Junjie, et al.
Published: (2024)
WoMAP: World Models For Embodied Open-Vocabulary Object Localization
by: Yin, Tenny, et al.
Published: (2025)
by: Yin, Tenny, et al.
Published: (2025)
OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding
by: Deng, Yinan, et al.
Published: (2024)
by: Deng, Yinan, et al.
Published: (2024)
Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments
by: Surikuchi, Aditya K, et al.
Published: (2026)
by: Surikuchi, Aditya K, et al.
Published: (2026)
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
by: Shao, Yawen, et al.
Published: (2024)
by: Shao, Yawen, et al.
Published: (2024)
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation
by: Zhou, Yang, et al.
Published: (2025)
by: Zhou, Yang, et al.
Published: (2025)
From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
by: Huang, Kung-Hsiang, et al.
Published: (2024)
by: Huang, Kung-Hsiang, et al.
Published: (2024)
Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach
by: Sakajo, Haruki, et al.
Published: (2025)
by: Sakajo, Haruki, et al.
Published: (2025)
Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models
by: Zhu, Wanrong, et al.
Published: (2024)
by: Zhu, Wanrong, et al.
Published: (2024)
Automatic benchmarking of large multimodal models via iterative experiment programming
by: Conti, Alessandro, et al.
Published: (2024)
by: Conti, Alessandro, et al.
Published: (2024)
Towards Open Vocabulary Learning: A Survey
by: Wu, Jianzong, et al.
Published: (2023)
by: Wu, Jianzong, et al.
Published: (2023)
Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models
by: Zhang, Huixuan, et al.
Published: (2025)
by: Zhang, Huixuan, et al.
Published: (2025)
A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs
by: Broomfield, Julius, et al.
Published: (2025)
by: Broomfield, Julius, et al.
Published: (2025)
Mitigating Open-Vocabulary Caption Hallucinations
by: Ben-Kish, Assaf, et al.
Published: (2023)
by: Ben-Kish, Assaf, et al.
Published: (2023)
NeoBabel: A Multilingual Open Tower for Visual Generation
by: Derakhshani, Mohammad Mahdi, et al.
Published: (2025)
by: Derakhshani, Mohammad Mahdi, et al.
Published: (2025)
Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models
by: Ohi, Masanari, et al.
Published: (2024)
by: Ohi, Masanari, et al.
Published: (2024)
Similar Items
-
Open-Vocabulary Object Detection via Language Hierarchy
by: Huang, Jiaxing, et al.
Published: (2024) -
Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR
by: Guo, Lixing, et al.
Published: (2025) -
HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection
by: Ma, Yuqi, et al.
Published: (2024) -
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
by: Barsellotti, Luca, et al.
Published: (2024) -
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
by: Ma, Ziqiao, et al.
Published: (2023)