:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kim, Wonjae, Chun, Sanghyuk, Kim, Taekyung, Han, Dongyoon, Yun, Sangdoo
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.17507
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Emergence of Text Readability in Vision Language Models
by: Park, Jaeyoo, et al.
Published: (2025)

Probabilistic Language-Image Pre-Training
by: Chun, Sanghyuk, et al.
Published: (2024)

Learning with Unmasked Tokens Drives Stronger Vision Learners
by: Kim, Taekyung, et al.
Published: (2023)

Language-only Efficient Training of Zero-shot Composed Image Retrieval
by: Gu, Geonmo, et al.
Published: (2023)

Masking meets Supervision: A Strong Learning Alliance
by: Heo, Byeongho, et al.
Published: (2023)

Token Bottleneck: One Token to Remember Dynamics
by: Kim, Taekyung, et al.
Published: (2025)

LongProLIP: A Probabilistic Vision-Language Model with Long Context Text
by: Chun, Sanghyuk, et al.
Published: (2025)

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion
by: Gu, Geonmo, et al.
Published: (2023)

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
by: Kwak, Min-Seop, et al.
Published: (2025)

Morphing Tokens Draw Strong Masked Image Models
by: Kim, Taekyung, et al.
Published: (2023)

Toward Interactive Regional Understanding in Vision-Large Language Models
by: Lee, Jungbeom, et al.
Published: (2024)

Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images
by: Kim, Jiwon, et al.
Published: (2023)

An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval
by: Byun, Jaeseok, et al.
Published: (2024)

Leveraging Temporal Contextualization for Video Action Recognition
by: Kim, Minji, et al.
Published: (2024)

ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO
by: Chun, Sanghyuk, et al.
Published: (2022)

RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models
by: Park, Seulki, et al.
Published: (2023)

Model Stock: All we need is just a few fine-tuned models
by: Jang, Dong-Hwan, et al.
Published: (2024)

Improved Probabilistic Image-Text Representations
by: Chun, Sanghyuk
Published: (2023)

Exploring Conditions for Diffusion models in Robotic Control
by: Shin, Heeseong, et al.
Published: (2025)

Rotary Position Embedding for Vision Transformer
by: Heo, Byeongho, et al.
Published: (2024)

DNNs May Determine Major Properties of Their Outputs Early, with Timing Possibly Driven by Bias
by: Park, Song, et al.
Published: (2025)

DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation
by: Oh, Changdae, et al.
Published: (2024)

RL makes MLLMs see better than SFT
by: Song, Junha, et al.
Published: (2025)

HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment
by: Chen, Wenzhi, et al.
Published: (2026)

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
by: Song, Junha, et al.
Published: (2026)

Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks
by: Jo, Minsoo, et al.
Published: (2025)

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
by: Kim, Minji, et al.
Published: (2025)

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models
by: Park, Yong-Hyun, et al.
Published: (2024)

Similarity of Neural Architectures using Adversarial Attack Transferability
by: Hwang, Jaehui, et al.
Published: (2022)

Read, Watch and Scream! Sound Generation from Text and Video
by: Jeong, Yujin, et al.
Published: (2024)

Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning
by: Chun, Sanghyuk
Published: (2025)

Towards Calibrated Robust Fine-Tuning of Vision-Language Models
by: Oh, Changdae, et al.
Published: (2023)

Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
by: Jin, Hyundong, et al.
Published: (2026)

Grounding World Simulation Models in a Real-World Metropolis
by: Seo, Junyoung, et al.
Published: (2026)

DELST: Dual Entailment Learning for Hyperbolic Image-Gene Pretraining in Spatial Transcriptomics
by: Chen, Xulin, et al.
Published: (2025)

HYPE-EDIT-1: Benchmark for Measuring Reliability in Frontier Image Editing Models
by: Chan, Wing, et al.
Published: (2026)

ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
by: Kim, Hoyoung, et al.
Published: (2026)

WorldKV: Efficient World Memory with World Retrieval and Compression
by: Yi, Jung, et al.
Published: (2026)

A Simple Baseline with Single-encoder for Referring Image Segmentation
by: Yu, Seonghoon, et al.
Published: (2024)

DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs
by: Kim, Donghyun, et al.
Published: (2024)