Saved in:
| Main Authors: | Wang, Shijie, Kim, Dahun, Taalimi, Ali, Sun, Chen, Kuo, Weicheng |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.14563 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
by: Kim, Dahun, et al.
Published: (2023)
by: Kim, Dahun, et al.
Published: (2023)
Zero-Shot 3D Visual Grounding from Vision-Language Models
by: Li, Rong, et al.
Published: (2025)
by: Li, Rong, et al.
Published: (2025)
Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models
by: Jin, Hyundong, et al.
Published: (2025)
by: Jin, Hyundong, et al.
Published: (2025)
Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning
by: Kim, Dahun, et al.
Published: (2026)
by: Kim, Dahun, et al.
Published: (2026)
OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All
by: Lyu, Yuanhuiyi, et al.
Published: (2024)
by: Lyu, Yuanhuiyi, et al.
Published: (2024)
Time-Scaling State-Space Models for Dense Video Captioning
by: Piergiovanni, AJ, et al.
Published: (2025)
by: Piergiovanni, AJ, et al.
Published: (2025)
Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models
by: Yang, Xiaoyu, et al.
Published: (2023)
by: Yang, Xiaoyu, et al.
Published: (2023)
Feature Projection Learning for Better Vision-Language Reasoning
by: Zhang, Yi, et al.
Published: (2026)
by: Zhang, Yi, et al.
Published: (2026)
Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation
by: Bose, Sarosij, et al.
Published: (2025)
by: Bose, Sarosij, et al.
Published: (2025)
DriveMA: Driving Vision-Language-Action Models with verifiable Meta-Actions
by: Zheng, Weicheng, et al.
Published: (2026)
by: Zheng, Weicheng, et al.
Published: (2026)
VividMed: Vision Language Model with Versatile Visual Grounding for Medicine
by: Luo, Lingxiao, et al.
Published: (2024)
by: Luo, Lingxiao, et al.
Published: (2024)
Do Pre-trained Vision-Language Models Encode Object States?
by: Newman, Kaleb, et al.
Published: (2024)
by: Newman, Kaleb, et al.
Published: (2024)
Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation
by: Chen, Bolei, et al.
Published: (2025)
by: Chen, Bolei, et al.
Published: (2025)
GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding
by: Zhou, Yue, et al.
Published: (2024)
by: Zhou, Yue, et al.
Published: (2024)
Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications
by: Mallya, Ganesh, et al.
Published: (2025)
by: Mallya, Ganesh, et al.
Published: (2025)
IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth
by: Islam, Md Touhidul, et al.
Published: (2025)
by: Islam, Md Touhidul, et al.
Published: (2025)
Dissecting Bit-Level Scaling Laws in Quantizing Vision Generative Models
by: Ding, Xin, et al.
Published: (2025)
by: Ding, Xin, et al.
Published: (2025)
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
by: Kang, Seil, et al.
Published: (2025)
by: Kang, Seil, et al.
Published: (2025)
Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings
by: Huemann, Zachary, et al.
Published: (2025)
by: Huemann, Zachary, et al.
Published: (2025)
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
by: Li, Rong, et al.
Published: (2024)
by: Li, Rong, et al.
Published: (2024)
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
by: Kim, Dahun, et al.
Published: (2025)
by: Kim, Dahun, et al.
Published: (2025)
Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models
by: Lu, Jiaying, et al.
Published: (2023)
by: Lu, Jiaying, et al.
Published: (2023)
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
by: Sun, Haoyi, et al.
Published: (2026)
by: Sun, Haoyi, et al.
Published: (2026)
DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
by: Li, Yangfu, et al.
Published: (2026)
by: Li, Yangfu, et al.
Published: (2026)
Curing Semantic Drift: A Dynamic Approach to Grounding Generation in Large Vision-Language Models
by: Chen, Jiahe, et al.
Published: (2025)
by: Chen, Jiahe, et al.
Published: (2025)
Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models
by: Nützel, Felix, et al.
Published: (2025)
by: Nützel, Felix, et al.
Published: (2025)
Language-Guided Diffusion Model for Visual Grounding
by: Chen, Sijia, et al.
Published: (2023)
by: Chen, Sijia, et al.
Published: (2023)
Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance
by: Wang, Chenyu, et al.
Published: (2026)
by: Wang, Chenyu, et al.
Published: (2026)
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
by: Huang, Haifeng, et al.
Published: (2025)
by: Huang, Haifeng, et al.
Published: (2025)
Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)
by: Zhou, Yucheng, et al.
Published: (2024)
OpenFrontier: General Navigation with Visual-Language Grounded Frontiers
by: Padilla-Cerdio, Esteban, et al.
Published: (2026)
by: Padilla-Cerdio, Esteban, et al.
Published: (2026)
BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models
by: Hu, Xuefeng, et al.
Published: (2024)
by: Hu, Xuefeng, et al.
Published: (2024)
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI
by: Ernhofer, Benjamin Raphael, et al.
Published: (2025)
by: Ernhofer, Benjamin Raphael, et al.
Published: (2025)
AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models
by: Mahmood, Hazza, et al.
Published: (2026)
by: Mahmood, Hazza, et al.
Published: (2026)
D-Attn: Decomposed Attention for Large Vision-and-Language Models
by: Kuo, Chia-Wen, et al.
Published: (2025)
by: Kuo, Chia-Wen, et al.
Published: (2025)
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
by: Piergiovanni, AJ, et al.
Published: (2023)
by: Piergiovanni, AJ, et al.
Published: (2023)
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding
by: Zheng, Henry, et al.
Published: (2025)
by: Zheng, Henry, et al.
Published: (2025)
Consistency-guided Prompt Learning for Vision-Language Models
by: Roy, Shuvendu, et al.
Published: (2023)
by: Roy, Shuvendu, et al.
Published: (2023)
AffordanceLLM: Grounding Affordance from Vision Language Models
by: Qian, Shengyi, et al.
Published: (2024)
by: Qian, Shengyi, et al.
Published: (2024)
Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models
by: Jian, Pu, et al.
Published: (2025)
by: Jian, Pu, et al.
Published: (2025)
Similar Items
-
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
by: Kim, Dahun, et al.
Published: (2023) -
Zero-Shot 3D Visual Grounding from Vision-Language Models
by: Li, Rong, et al.
Published: (2025) -
Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models
by: Jin, Hyundong, et al.
Published: (2025) -
Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning
by: Kim, Dahun, et al.
Published: (2026) -
OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All
by: Lyu, Yuanhuiyi, et al.
Published: (2024)