Saved in:
| Main Authors: | Ko, Dohwan, Lee, Ji Soo, Choi, Minhyuk, Meng, Zihang, Kim, Hyunwoo J. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.23284 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models
by: Ko, Dohwan, et al.
Published: (2026)
by: Ko, Dohwan, et al.
Published: (2026)
Captioning for Text-Video Retrieval via Dual-Group Direct Preference Optimization
by: Lee, Ji Soo, et al.
Published: (2025)
by: Lee, Ji Soo, et al.
Published: (2025)
DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
by: Choi, Joonmyung, et al.
Published: (2026)
by: Choi, Joonmyung, et al.
Published: (2026)
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
by: Choi, Joonmyung, et al.
Published: (2024)
by: Choi, Joonmyung, et al.
Published: (2024)
Efficient multi-view training for 3D Gaussian Splatting
by: Choi, Minhyuk, et al.
Published: (2025)
by: Choi, Minhyuk, et al.
Published: (2025)
ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
by: Ko, Dohwan, et al.
Published: (2025)
by: Ko, Dohwan, et al.
Published: (2025)
What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models
by: Choi, Dasol, et al.
Published: (2026)
by: Choi, Dasol, et al.
Published: (2026)
VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning
by: Lee, Ji Soo, et al.
Published: (2025)
by: Lee, Ji Soo, et al.
Published: (2025)
Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives
by: Park, Ji-jun, et al.
Published: (2024)
by: Park, Ji-jun, et al.
Published: (2024)
Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
by: Choi, Seung hee, et al.
Published: (2026)
by: Choi, Seung hee, et al.
Published: (2026)
Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation
by: Joo, Minseok, et al.
Published: (2026)
by: Joo, Minseok, et al.
Published: (2026)
Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers
by: Lee, Sanghyeok, et al.
Published: (2024)
by: Lee, Sanghyeok, et al.
Published: (2024)
Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering
by: Kim, Jongha, et al.
Published: (2026)
by: Kim, Jongha, et al.
Published: (2026)
Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
by: Jang, Young Kyun, et al.
Published: (2024)
by: Jang, Young Kyun, et al.
Published: (2024)
Representation Shift: Unifying Token Compression with FlashAttention
by: Choi, Joonmyung, et al.
Published: (2025)
by: Choi, Joonmyung, et al.
Published: (2025)
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval
by: Kim, Minkuk, et al.
Published: (2024)
by: Kim, Minkuk, et al.
Published: (2024)
Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models
by: Jung, Daniel Sungho, et al.
Published: (2026)
by: Jung, Daniel Sungho, et al.
Published: (2026)
UDC-VIT: A Real-World Video Dataset for Under-Display Cameras
by: Ahn, Kyusu, et al.
Published: (2025)
by: Ahn, Kyusu, et al.
Published: (2025)
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
by: Lee, Young-Jun, et al.
Published: (2025)
by: Lee, Young-Jun, et al.
Published: (2025)
TCMA: Text-Conditioned Multi-granularity Alignment for Drone Cross-Modal Text-Video Retrieval
by: Zhao, Zixu, et al.
Published: (2025)
by: Zhao, Zixu, et al.
Published: (2025)
Proxy-Free Gaussian Splats Deformation with Splat-Based Surface Estimation
by: Kim, Jaeyeong, et al.
Published: (2025)
by: Kim, Jaeyeong, et al.
Published: (2025)
Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass
by: Kim, Sangmin, et al.
Published: (2026)
by: Kim, Sangmin, et al.
Published: (2026)
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality
by: Lee, Sanghyeok, et al.
Published: (2024)
by: Lee, Sanghyeok, et al.
Published: (2024)
Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization
by: Lim, Geuntaek, et al.
Published: (2024)
by: Lim, Geuntaek, et al.
Published: (2024)
Learning Equi-angular Representations for Online Continual Learning
by: Seo, Minhyuk, et al.
Published: (2024)
by: Seo, Minhyuk, et al.
Published: (2024)
ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation
by: Min, Yunhong, et al.
Published: (2025)
by: Min, Yunhong, et al.
Published: (2025)
Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models
by: Kim, Mingyeong, et al.
Published: (2026)
by: Kim, Mingyeong, et al.
Published: (2026)
Fine Tuning Text-to-Image Diffusion Models for Correcting Anomalous Images
by: Yoo, Hyunwoo
Published: (2024)
by: Yoo, Hyunwoo
Published: (2024)
Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models
by: Hwang, Jisung, et al.
Published: (2025)
by: Hwang, Jisung, et al.
Published: (2025)
Can Language Models Laugh at YouTube Short-form Videos?
by: Ko, Dayoon, et al.
Published: (2023)
by: Ko, Dayoon, et al.
Published: (2023)
Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks
by: Li, Qian, et al.
Published: (2024)
by: Li, Qian, et al.
Published: (2024)
Prompt Learning via Meta-Regularization
by: Park, Jinyoung, et al.
Published: (2024)
by: Park, Jinyoung, et al.
Published: (2024)
Stochastic Conditional Diffusion Models for Robust Semantic Image Synthesis
by: Ko, Juyeon, et al.
Published: (2024)
by: Ko, Juyeon, et al.
Published: (2024)
Intriguing Properties of Large Language and Vision Models
by: Lee, Young-Jun, et al.
Published: (2024)
by: Lee, Young-Jun, et al.
Published: (2024)
Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models
by: Shen, Yiqing, et al.
Published: (2025)
by: Shen, Yiqing, et al.
Published: (2025)
ReCo: Reminder Composition Mitigates Hallucinations in Vision-Language Models
by: Chytas, Sotirios Panagiotis, et al.
Published: (2025)
by: Chytas, Sotirios Panagiotis, et al.
Published: (2025)
Robust Multimodal 3D Object Detection via Modality-Agnostic Decoding and Proximity-based Modality Ensemble
by: Cha, Juhan, et al.
Published: (2024)
by: Cha, Juhan, et al.
Published: (2024)
Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval
by: Fang, Xiang, et al.
Published: (2026)
by: Fang, Xiang, et al.
Published: (2026)
Slow-Fast Architecture for Video Multi-Modal Large Language Models
by: Shi, Min, et al.
Published: (2025)
by: Shi, Min, et al.
Published: (2025)
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
by: Son, Taein, et al.
Published: (2024)
by: Son, Taein, et al.
Published: (2024)
Similar Items
-
MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models
by: Ko, Dohwan, et al.
Published: (2026) -
Captioning for Text-Video Retrieval via Dual-Group Direct Preference Optimization
by: Lee, Ji Soo, et al.
Published: (2025) -
DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
by: Choi, Joonmyung, et al.
Published: (2026) -
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
by: Choi, Joonmyung, et al.
Published: (2024) -
Efficient multi-view training for 3D Gaussian Splatting
by: Choi, Minhyuk, et al.
Published: (2025)