:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ko, Dohwan, Lee, Ji Soo, Choi, Minhyuk, Meng, Zihang, Kim, Hyunwoo J.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.23284
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models
by: Ko, Dohwan, et al.
Published: (2026)

Captioning for Text-Video Retrieval via Dual-Group Direct Preference Optimization
by: Lee, Ji Soo, et al.
Published: (2025)

DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
by: Choi, Joonmyung, et al.
Published: (2026)

vid-TLDR: Training Free Token merging for Light-weight Video Transformer
by: Choi, Joonmyung, et al.
Published: (2024)

Efficient multi-view training for 3D Gaussian Splatting
by: Choi, Minhyuk, et al.
Published: (2025)

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
by: Ko, Dohwan, et al.
Published: (2025)

What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models
by: Choi, Dasol, et al.
Published: (2026)

VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning
by: Lee, Ji Soo, et al.
Published: (2025)

Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives
by: Park, Ji-jun, et al.
Published: (2024)

Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
by: Choi, Seung hee, et al.
Published: (2026)

Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation
by: Joo, Minseok, et al.
Published: (2026)

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers
by: Lee, Sanghyeok, et al.
Published: (2024)

Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering
by: Kim, Jongha, et al.
Published: (2026)

Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
by: Jang, Young Kyun, et al.
Published: (2024)

Representation Shift: Unifying Token Compression with FlashAttention
by: Choi, Joonmyung, et al.
Published: (2025)

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
by: Kim, Minkuk, et al.
Published: (2024)

Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models
by: Jung, Daniel Sungho, et al.
Published: (2026)

UDC-VIT: A Real-World Video Dataset for Under-Display Cameras
by: Ahn, Kyusu, et al.
Published: (2025)

MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
by: Lee, Young-Jun, et al.
Published: (2025)

TCMA: Text-Conditioned Multi-granularity Alignment for Drone Cross-Modal Text-Video Retrieval
by: Zhao, Zixu, et al.
Published: (2025)

Proxy-Free Gaussian Splats Deformation with Splat-Based Surface Estimation
by: Kim, Jaeyeong, et al.
Published: (2025)

Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass
by: Kim, Sangmin, et al.
Published: (2026)

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality
by: Lee, Sanghyeok, et al.
Published: (2024)

Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization
by: Lim, Geuntaek, et al.
Published: (2024)

Learning Equi-angular Representations for Online Continual Learning
by: Seo, Minhyuk, et al.
Published: (2024)

ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation
by: Min, Yunhong, et al.
Published: (2025)

Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models
by: Kim, Mingyeong, et al.
Published: (2026)

Fine Tuning Text-to-Image Diffusion Models for Correcting Anomalous Images
by: Yoo, Hyunwoo
Published: (2024)

Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models
by: Hwang, Jisung, et al.
Published: (2025)

Can Language Models Laugh at YouTube Short-form Videos?
by: Ko, Dayoon, et al.
Published: (2023)

Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks
by: Li, Qian, et al.
Published: (2024)

Prompt Learning via Meta-Regularization
by: Park, Jinyoung, et al.
Published: (2024)

Stochastic Conditional Diffusion Models for Robust Semantic Image Synthesis
by: Ko, Juyeon, et al.
Published: (2024)

Intriguing Properties of Large Language and Vision Models
by: Lee, Young-Jun, et al.
Published: (2024)

Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models
by: Shen, Yiqing, et al.
Published: (2025)

ReCo: Reminder Composition Mitigates Hallucinations in Vision-Language Models
by: Chytas, Sotirios Panagiotis, et al.
Published: (2025)

Robust Multimodal 3D Object Detection via Modality-Agnostic Decoding and Proximity-based Modality Ensemble
by: Cha, Juhan, et al.
Published: (2024)

Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval
by: Fang, Xiang, et al.
Published: (2026)

Slow-Fast Architecture for Video Multi-Modal Large Language Models
by: Shi, Min, et al.
Published: (2025)

JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
by: Son, Taein, et al.
Published: (2024)