:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Woo, Byeongju, Wang, Zilin, Pak, Byeonghyun, Mo, Sangwoo, Yu, Stella X.
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2602.02977
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition
by: Lee, Seokmin, et al.
Published: (2026)

Open Ad-hoc Categorization with Contextualized Feature Learning
by: Wang, Zilin, et al.
Published: (2025)

Rethinking FID Through the Geometry of the Reference Dataset
by: Lee, Yunghee, et al.
Published: (2026)

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
by: Pak, Byeonghyun, et al.
Published: (2024)

Learning Hierarchical Image Segmentation For Recognition and By Recognition
by: Ke, Tsung-Wei, et al.
Published: (2022)

Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
by: Du, Zilin, et al.
Published: (2024)

Understanding the Effects of Distractors on Reasoning Vision-Language Models
by: Bae, Jiyun, et al.
Published: (2025)

ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning
by: Kim, Taewhan, et al.
Published: (2024)

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
by: Hsieh, Yu-Guan, et al.
Published: (2024)

Align Your Query: Representation Alignment for Multimodality Medical Object Detection
by: Seo, Ara, et al.
Published: (2025)

SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning
by: Kim, Si-Woo, et al.
Published: (2025)

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
by: Bansal, Hritik, et al.
Published: (2024)

Sample Selection via Contrastive Fragmentation for Noisy Label Regression
by: Kim, Chris Dongjoo, et al.
Published: (2025)

Co-domain Symmetry for Complex-Valued Deep Learning
by: Singhal, Utkarsh, et al.
Published: (2021)

Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution
by: Kim, Bryan Sangwoo, et al.
Published: (2026)

Extreme Blind Image Restoration via Prompt-Conditioned Information Bottleneck
by: Kim, Hongeun, et al.
Published: (2025)

Generalizable Geometric Image Caption Synthesis
by: Xin, Yue, et al.
Published: (2025)

LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning
by: Mo, Shentong, et al.
Published: (2024)

IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning
by: Lee, Soeun, et al.
Published: (2024)

Image Captions are Natural Prompts for Text-to-Image Models
by: Lei, Shiye, et al.
Published: (2023)

Free$^2$Guide: Training-Free Text-to-Video Alignment using Image LVLM
by: Kim, Jaemin, et al.
Published: (2024)

Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision
by: Cao, Shengcao, et al.
Published: (2024)

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
by: Li, Jialuo, et al.
Published: (2025)

Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding
by: Huy, Ta Duc, et al.
Published: (2025)

Controllable Hybrid Captioner for Improved Long-form Video Understanding
by: Sasse, Kuleen, et al.
Published: (2025)

Aligning Audio-Visual Joint Representations with an Agentic Workflow
by: Mo, Shentong, et al.
Published: (2024)

IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
by: Yang, Chenglin, et al.
Published: (2023)

VeCLIP: Improving CLIP Training via Visual-enriched Captions
by: Lai, Zhengfeng, et al.
Published: (2023)

Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding
by: He, Ziyao, et al.
Published: (2026)

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
by: Hashemi, Mohammad Abuzar, et al.
Published: (2021)

LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation
by: Mo, Shentong, et al.
Published: (2026)

Semi-Supervised Image Captioning Considering Wasserstein Graph Matching
by: Yang, Yang
Published: (2024)

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
by: Xiong, Tianwei, et al.
Published: (2024)

Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning
by: Dalvi, Abhishek, et al.
Published: (2026)

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
by: Huang, Tzu-Heng, et al.
Published: (2026)

Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning
by: You, Zuyao, et al.
Published: (2025)

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects
by: Fiastre, Gabriel, et al.
Published: (2025)

ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams
by: Kim, Chris Dongjoo, et al.
Published: (2025)

Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering
by: Anaissi, Ali, et al.
Published: (2025)

Towards Understanding Visual Grounding in Visual Language Models
by: Pantazopoulos, Georgios, et al.
Published: (2025)