Saved in:
| Main Authors: | Woo, Byeongju, Wang, Zilin, Pak, Byeonghyun, Mo, Sangwoo, Yu, Stella X. |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.02977 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition
by: Lee, Seokmin, et al.
Published: (2026)
by: Lee, Seokmin, et al.
Published: (2026)
Open Ad-hoc Categorization with Contextualized Feature Learning
by: Wang, Zilin, et al.
Published: (2025)
by: Wang, Zilin, et al.
Published: (2025)
Rethinking FID Through the Geometry of the Reference Dataset
by: Lee, Yunghee, et al.
Published: (2026)
by: Lee, Yunghee, et al.
Published: (2026)
Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
by: Pak, Byeonghyun, et al.
Published: (2024)
by: Pak, Byeonghyun, et al.
Published: (2024)
Learning Hierarchical Image Segmentation For Recognition and By Recognition
by: Ke, Tsung-Wei, et al.
Published: (2022)
by: Ke, Tsung-Wei, et al.
Published: (2022)
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
by: Du, Zilin, et al.
Published: (2024)
by: Du, Zilin, et al.
Published: (2024)
Understanding the Effects of Distractors on Reasoning Vision-Language Models
by: Bae, Jiyun, et al.
Published: (2025)
by: Bae, Jiyun, et al.
Published: (2025)
ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning
by: Kim, Taewhan, et al.
Published: (2024)
by: Kim, Taewhan, et al.
Published: (2024)
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
by: Hsieh, Yu-Guan, et al.
Published: (2024)
by: Hsieh, Yu-Guan, et al.
Published: (2024)
Align Your Query: Representation Alignment for Multimodality Medical Object Detection
by: Seo, Ara, et al.
Published: (2025)
by: Seo, Ara, et al.
Published: (2025)
SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning
by: Kim, Si-Woo, et al.
Published: (2025)
by: Kim, Si-Woo, et al.
Published: (2025)
TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
by: Bansal, Hritik, et al.
Published: (2024)
by: Bansal, Hritik, et al.
Published: (2024)
Sample Selection via Contrastive Fragmentation for Noisy Label Regression
by: Kim, Chris Dongjoo, et al.
Published: (2025)
by: Kim, Chris Dongjoo, et al.
Published: (2025)
Co-domain Symmetry for Complex-Valued Deep Learning
by: Singhal, Utkarsh, et al.
Published: (2021)
by: Singhal, Utkarsh, et al.
Published: (2021)
Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution
by: Kim, Bryan Sangwoo, et al.
Published: (2026)
by: Kim, Bryan Sangwoo, et al.
Published: (2026)
Extreme Blind Image Restoration via Prompt-Conditioned Information Bottleneck
by: Kim, Hongeun, et al.
Published: (2025)
by: Kim, Hongeun, et al.
Published: (2025)
Generalizable Geometric Image Caption Synthesis
by: Xin, Yue, et al.
Published: (2025)
by: Xin, Yue, et al.
Published: (2025)
LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning
by: Lee, Soeun, et al.
Published: (2024)
by: Lee, Soeun, et al.
Published: (2024)
Image Captions are Natural Prompts for Text-to-Image Models
by: Lei, Shiye, et al.
Published: (2023)
by: Lei, Shiye, et al.
Published: (2023)
Free$^2$Guide: Training-Free Text-to-Video Alignment using Image LVLM
by: Kim, Jaemin, et al.
Published: (2024)
by: Kim, Jaemin, et al.
Published: (2024)
Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision
by: Cao, Shengcao, et al.
Published: (2024)
by: Cao, Shengcao, et al.
Published: (2024)
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
by: Li, Jialuo, et al.
Published: (2025)
by: Li, Jialuo, et al.
Published: (2025)
Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding
by: Huy, Ta Duc, et al.
Published: (2025)
by: Huy, Ta Duc, et al.
Published: (2025)
Controllable Hybrid Captioner for Improved Long-form Video Understanding
by: Sasse, Kuleen, et al.
Published: (2025)
by: Sasse, Kuleen, et al.
Published: (2025)
Aligning Audio-Visual Joint Representations with an Agentic Workflow
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
by: Yang, Chenglin, et al.
Published: (2023)
by: Yang, Chenglin, et al.
Published: (2023)
VeCLIP: Improving CLIP Training via Visual-enriched Captions
by: Lai, Zhengfeng, et al.
Published: (2023)
by: Lai, Zhengfeng, et al.
Published: (2023)
Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding
by: He, Ziyao, et al.
Published: (2026)
by: He, Ziyao, et al.
Published: (2026)
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
by: Hashemi, Mohammad Abuzar, et al.
Published: (2021)
by: Hashemi, Mohammad Abuzar, et al.
Published: (2021)
LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation
by: Mo, Shentong, et al.
Published: (2026)
by: Mo, Shentong, et al.
Published: (2026)
Semi-Supervised Image Captioning Considering Wasserstein Graph Matching
by: Yang, Yang
Published: (2024)
by: Yang, Yang
Published: (2024)
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
by: Xiong, Tianwei, et al.
Published: (2024)
by: Xiong, Tianwei, et al.
Published: (2024)
Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning
by: Dalvi, Abhishek, et al.
Published: (2026)
by: Dalvi, Abhishek, et al.
Published: (2026)
RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
by: Huang, Tzu-Heng, et al.
Published: (2026)
by: Huang, Tzu-Heng, et al.
Published: (2026)
Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning
by: You, Zuyao, et al.
Published: (2025)
by: You, Zuyao, et al.
Published: (2025)
CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects
by: Fiastre, Gabriel, et al.
Published: (2025)
by: Fiastre, Gabriel, et al.
Published: (2025)
ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams
by: Kim, Chris Dongjoo, et al.
Published: (2025)
by: Kim, Chris Dongjoo, et al.
Published: (2025)
Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering
by: Anaissi, Ali, et al.
Published: (2025)
by: Anaissi, Ali, et al.
Published: (2025)
Towards Understanding Visual Grounding in Visual Language Models
by: Pantazopoulos, Georgios, et al.
Published: (2025)
by: Pantazopoulos, Georgios, et al.
Published: (2025)
Similar Items
-
Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition
by: Lee, Seokmin, et al.
Published: (2026) -
Open Ad-hoc Categorization with Contextualized Feature Learning
by: Wang, Zilin, et al.
Published: (2025) -
Rethinking FID Through the Geometry of the Reference Dataset
by: Lee, Yunghee, et al.
Published: (2026) -
Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
by: Pak, Byeonghyun, et al.
Published: (2024) -
Learning Hierarchical Image Segmentation For Recognition and By Recognition
by: Ke, Tsung-Wei, et al.
Published: (2022)