Saved in:
| Main Authors: | Kim, Wonjae, Chun, Sanghyuk, Kim, Taekyung, Han, Dongyoon, Yun, Sangdoo |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.17507 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Emergence of Text Readability in Vision Language Models
by: Park, Jaeyoo, et al.
Published: (2025)
by: Park, Jaeyoo, et al.
Published: (2025)
Probabilistic Language-Image Pre-Training
by: Chun, Sanghyuk, et al.
Published: (2024)
by: Chun, Sanghyuk, et al.
Published: (2024)
Learning with Unmasked Tokens Drives Stronger Vision Learners
by: Kim, Taekyung, et al.
Published: (2023)
by: Kim, Taekyung, et al.
Published: (2023)
Language-only Efficient Training of Zero-shot Composed Image Retrieval
by: Gu, Geonmo, et al.
Published: (2023)
by: Gu, Geonmo, et al.
Published: (2023)
Masking meets Supervision: A Strong Learning Alliance
by: Heo, Byeongho, et al.
Published: (2023)
by: Heo, Byeongho, et al.
Published: (2023)
Token Bottleneck: One Token to Remember Dynamics
by: Kim, Taekyung, et al.
Published: (2025)
by: Kim, Taekyung, et al.
Published: (2025)
LongProLIP: A Probabilistic Vision-Language Model with Long Context Text
by: Chun, Sanghyuk, et al.
Published: (2025)
by: Chun, Sanghyuk, et al.
Published: (2025)
CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion
by: Gu, Geonmo, et al.
Published: (2023)
by: Gu, Geonmo, et al.
Published: (2023)
Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
by: Kwak, Min-Seop, et al.
Published: (2025)
by: Kwak, Min-Seop, et al.
Published: (2025)
Morphing Tokens Draw Strong Masked Image Models
by: Kim, Taekyung, et al.
Published: (2023)
by: Kim, Taekyung, et al.
Published: (2023)
Toward Interactive Regional Understanding in Vision-Large Language Models
by: Lee, Jungbeom, et al.
Published: (2024)
by: Lee, Jungbeom, et al.
Published: (2024)
Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images
by: Kim, Jiwon, et al.
Published: (2023)
by: Kim, Jiwon, et al.
Published: (2023)
An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval
by: Byun, Jaeseok, et al.
Published: (2024)
by: Byun, Jaeseok, et al.
Published: (2024)
Leveraging Temporal Contextualization for Video Action Recognition
by: Kim, Minji, et al.
Published: (2024)
by: Kim, Minji, et al.
Published: (2024)
ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO
by: Chun, Sanghyuk, et al.
Published: (2022)
by: Chun, Sanghyuk, et al.
Published: (2022)
RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models
by: Park, Seulki, et al.
Published: (2023)
by: Park, Seulki, et al.
Published: (2023)
Model Stock: All we need is just a few fine-tuned models
by: Jang, Dong-Hwan, et al.
Published: (2024)
by: Jang, Dong-Hwan, et al.
Published: (2024)
Improved Probabilistic Image-Text Representations
by: Chun, Sanghyuk
Published: (2023)
by: Chun, Sanghyuk
Published: (2023)
Exploring Conditions for Diffusion models in Robotic Control
by: Shin, Heeseong, et al.
Published: (2025)
by: Shin, Heeseong, et al.
Published: (2025)
Rotary Position Embedding for Vision Transformer
by: Heo, Byeongho, et al.
Published: (2024)
by: Heo, Byeongho, et al.
Published: (2024)
DNNs May Determine Major Properties of Their Outputs Early, with Timing Possibly Driven by Bias
by: Park, Song, et al.
Published: (2025)
by: Park, Song, et al.
Published: (2025)
DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation
by: Oh, Changdae, et al.
Published: (2024)
by: Oh, Changdae, et al.
Published: (2024)
RL makes MLLMs see better than SFT
by: Song, Junha, et al.
Published: (2025)
by: Song, Junha, et al.
Published: (2025)
HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment
by: Chen, Wenzhi, et al.
Published: (2026)
by: Chen, Wenzhi, et al.
Published: (2026)
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
by: Song, Junha, et al.
Published: (2026)
by: Song, Junha, et al.
Published: (2026)
Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks
by: Jo, Minsoo, et al.
Published: (2025)
by: Jo, Minsoo, et al.
Published: (2025)
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
by: Kim, Minji, et al.
Published: (2025)
by: Kim, Minji, et al.
Published: (2025)
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models
by: Park, Yong-Hyun, et al.
Published: (2024)
by: Park, Yong-Hyun, et al.
Published: (2024)
Similarity of Neural Architectures using Adversarial Attack Transferability
by: Hwang, Jaehui, et al.
Published: (2022)
by: Hwang, Jaehui, et al.
Published: (2022)
Read, Watch and Scream! Sound Generation from Text and Video
by: Jeong, Yujin, et al.
Published: (2024)
by: Jeong, Yujin, et al.
Published: (2024)
Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning
by: Chun, Sanghyuk
Published: (2025)
by: Chun, Sanghyuk
Published: (2025)
Towards Calibrated Robust Fine-Tuning of Vision-Language Models
by: Oh, Changdae, et al.
Published: (2023)
by: Oh, Changdae, et al.
Published: (2023)
Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
by: Jin, Hyundong, et al.
Published: (2026)
by: Jin, Hyundong, et al.
Published: (2026)
Grounding World Simulation Models in a Real-World Metropolis
by: Seo, Junyoung, et al.
Published: (2026)
by: Seo, Junyoung, et al.
Published: (2026)
DELST: Dual Entailment Learning for Hyperbolic Image-Gene Pretraining in Spatial Transcriptomics
by: Chen, Xulin, et al.
Published: (2025)
by: Chen, Xulin, et al.
Published: (2025)
HYPE-EDIT-1: Benchmark for Measuring Reliability in Frontier Image Editing Models
by: Chan, Wing, et al.
Published: (2026)
by: Chan, Wing, et al.
Published: (2026)
ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
by: Kim, Hoyoung, et al.
Published: (2026)
by: Kim, Hoyoung, et al.
Published: (2026)
WorldKV: Efficient World Memory with World Retrieval and Compression
by: Yi, Jung, et al.
Published: (2026)
by: Yi, Jung, et al.
Published: (2026)
A Simple Baseline with Single-encoder for Referring Image Segmentation
by: Yu, Seonghoon, et al.
Published: (2024)
by: Yu, Seonghoon, et al.
Published: (2024)
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs
by: Kim, Donghyun, et al.
Published: (2024)
by: Kim, Donghyun, et al.
Published: (2024)
Similar Items
-
Emergence of Text Readability in Vision Language Models
by: Park, Jaeyoo, et al.
Published: (2025) -
Probabilistic Language-Image Pre-Training
by: Chun, Sanghyuk, et al.
Published: (2024) -
Learning with Unmasked Tokens Drives Stronger Vision Learners
by: Kim, Taekyung, et al.
Published: (2023) -
Language-only Efficient Training of Zero-shot Composed Image Retrieval
by: Gu, Geonmo, et al.
Published: (2023) -
Masking meets Supervision: A Strong Learning Alliance
by: Heo, Byeongho, et al.
Published: (2023)