:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Park, Jaeyoo, Chun, Sanghyuk, Kim, Wonjae, Yun, Sangdoo, Han, Bohyung
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2506.19389
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Probabilistic Language-Image Pre-Training
by: Chun, Sanghyuk, et al.
Published: (2024)

HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts
by: Kim, Wonjae, et al.
Published: (2024)

LongProLIP: A Probabilistic Vision-Language Model with Long Context Text
by: Chun, Sanghyuk, et al.
Published: (2025)

Toward Interactive Regional Understanding in Vision-Large Language Models
by: Lee, Jungbeom, et al.
Published: (2024)

Language-only Efficient Training of Zero-shot Composed Image Retrieval
by: Gu, Geonmo, et al.
Published: (2023)

Cross-Class Feature Augmentation for Class Incremental Learning
by: Kim, Taehoon, et al.
Published: (2023)

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
by: Park, Jaeyoo, et al.
Published: (2024)

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion
by: Gu, Geonmo, et al.
Published: (2023)

RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models
by: Park, Seulki, et al.
Published: (2023)

An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval
by: Byun, Jaeseok, et al.
Published: (2024)

ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO
by: Chun, Sanghyuk, et al.
Published: (2022)

Improved Probabilistic Image-Text Representations
by: Chun, Sanghyuk
Published: (2023)

PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis
by: Kim, Mijeong, et al.
Published: (2025)

Learning with Unmasked Tokens Drives Stronger Vision Learners
by: Kim, Taekyung, et al.
Published: (2023)

Rotary Position Embedding for Vision Transformer
by: Heo, Byeongho, et al.
Published: (2024)

Token Bottleneck: One Token to Remember Dynamics
by: Kim, Taekyung, et al.
Published: (2025)

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models
by: Park, Yong-Hyun, et al.
Published: (2024)

Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning
by: Chun, Sanghyuk
Published: (2025)

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
by: Kim, Minji, et al.
Published: (2025)

GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes
by: Kim, Mijeong, et al.
Published: (2026)

DNNs May Determine Major Properties of Their Outputs Early, with Timing Possibly Driven by Bias
by: Park, Song, et al.
Published: (2025)

FIFO-Diffusion: Generating Infinite Videos from Text without Training
by: Kim, Jihwan, et al.
Published: (2024)

Towards Calibrated Robust Fine-Tuning of Vision-Language Models
by: Oh, Changdae, et al.
Published: (2023)

Leveraging Temporal Contextualization for Video Action Recognition
by: Kim, Minji, et al.
Published: (2024)

Masking meets Supervision: A Strong Learning Alliance
by: Heo, Byeongho, et al.
Published: (2023)

Model Stock: All we need is just a few fine-tuned models
by: Jang, Dong-Hwan, et al.
Published: (2024)

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
by: Song, Junha, et al.
Published: (2026)

Read, Watch and Scream! Sound Generation from Text and Video
by: Jeong, Yujin, et al.
Published: (2024)

Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images
by: Kim, Jiwon, et al.
Published: (2023)

4D Gaussian Splatting in the Wild with Uncertainty-Aware Regularization
by: Kim, Mijeong, et al.
Published: (2024)

CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition
by: Semnani, Sina J., et al.
Published: (2025)

Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
by: Park, Yeji, et al.
Published: (2025)

TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment
by: Baek, Kanghyun, et al.
Published: (2025)

Fine-Grained Captioning of Long Videos through Scene Graph Consolidation
by: Chu, Sanghyeok, et al.
Published: (2025)

Diffusion-Based Conditional Image Editing through Optimized Inference with Guidance
by: Lee, Hyunsoo, et al.
Published: (2024)

Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation
by: Lee, Junsung, et al.
Published: (2024)

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
by: Kwak, Min-Seop, et al.
Published: (2025)

ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
by: Kim, Hoyoung, et al.
Published: (2026)

ODGS: 3D Scene Reconstruction from Omnidirectional Images with 3D Gaussian Splattings
by: Lee, Suyoung, et al.
Published: (2024)

Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning
by: Kim, Taehoon, et al.
Published: (2025)