Saved in:
| Main Authors: | Madaan, Divyam, Chopra, Sumit, Cho, Kyunghyun |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.16979 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning
by: Madaan, Divyam, et al.
Published: (2024)
by: Madaan, Divyam, et al.
Published: (2024)
Temporal Generalization: A Reality Check
by: Madaan, Divyam, et al.
Published: (2025)
by: Madaan, Divyam, et al.
Published: (2025)
Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
by: Madaan, Divyam, et al.
Published: (2025)
by: Madaan, Divyam, et al.
Published: (2025)
HIST-AID: Leveraging Historical Patient Reports for Enhanced Multi-Modal Automatic Diagnosis
by: Huang, Haoxu, et al.
Published: (2024)
by: Huang, Haoxu, et al.
Published: (2024)
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
by: Huang, Chengyue, et al.
Published: (2025)
by: Huang, Chengyue, et al.
Published: (2025)
Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models
by: Huang, Chengyue, et al.
Published: (2025)
by: Huang, Chengyue, et al.
Published: (2025)
Hyperparameters in Continual Learning: A Reality Check
by: Cha, Sungmin, et al.
Published: (2024)
by: Cha, Sungmin, et al.
Published: (2024)
Impact of Noisy Supervision in Foundation Model Learning
by: Chen, Hao, et al.
Published: (2024)
by: Chen, Hao, et al.
Published: (2024)
LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multimodal Large Language Models
by: Zhu, Mengdan, et al.
Published: (2024)
by: Zhu, Mengdan, et al.
Published: (2024)
Everything is a Video: Unifying Modalities through Next-Frame Prediction
by: Hudson, G. Thomas, et al.
Published: (2024)
by: Hudson, G. Thomas, et al.
Published: (2024)
It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap
by: Fahim, Abrar, et al.
Published: (2024)
by: Fahim, Abrar, et al.
Published: (2024)
Multimodal Latent Language Modeling with Next-Token Diffusion
by: Sun, Yutao, et al.
Published: (2024)
by: Sun, Yutao, et al.
Published: (2024)
Directional Gradient Projection for Robust Fine-Tuning of Foundation Models
by: Huang, Chengyue, et al.
Published: (2025)
by: Huang, Chengyue, et al.
Published: (2025)
X-VILA: Cross-Modality Alignment for Large Language Model
by: Ye, Hanrong, et al.
Published: (2024)
by: Ye, Hanrong, et al.
Published: (2024)
Vision-Language Models Create Cross-Modal Task Representations
by: Luo, Grace, et al.
Published: (2024)
by: Luo, Grace, et al.
Published: (2024)
SpurLens: Automatic Detection of Spurious Cues in Multimodal LLMs
by: Hosseini, Parsa, et al.
Published: (2025)
by: Hosseini, Parsa, et al.
Published: (2025)
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
by: Zhou, Yiyang, et al.
Published: (2024)
by: Zhou, Yiyang, et al.
Published: (2024)
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
by: Rajabi, Navid, et al.
Published: (2023)
by: Rajabi, Navid, et al.
Published: (2023)
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2023)
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2023)
A training regime to learn unified representations from complementary breast imaging modalities
by: Sharma, Umang, et al.
Published: (2024)
by: Sharma, Umang, et al.
Published: (2024)
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
by: Chen, Junyi, et al.
Published: (2023)
by: Chen, Junyi, et al.
Published: (2023)
Text-centric Alignment for Multi-Modality Learning
by: Tsai, Yun-Da, et al.
Published: (2024)
by: Tsai, Yun-Da, et al.
Published: (2024)
Mem-W: Latent Memory-Native GUI Agents
by: Zhang, Guibin, et al.
Published: (2026)
by: Zhang, Guibin, et al.
Published: (2026)
CROME: Cross-Modal Adapters for Efficient Multimodal LLM
by: Ebrahimi, Sayna, et al.
Published: (2024)
by: Ebrahimi, Sayna, et al.
Published: (2024)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
by: Ghazanfari, Sara, et al.
Published: (2024)
by: Ghazanfari, Sara, et al.
Published: (2024)
Multi-Modal Hallucination Control by Visual Information Grounding
by: Favero, Alessandro, et al.
Published: (2024)
by: Favero, Alessandro, et al.
Published: (2024)
Ethology of Latent Spaces
by: Boisnard, Philippe
Published: (2026)
by: Boisnard, Philippe
Published: (2026)
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
by: Liang, Weixin, et al.
Published: (2025)
by: Liang, Weixin, et al.
Published: (2025)
Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation
by: Lin, Ci-Siang, et al.
Published: (2024)
by: Lin, Ci-Siang, et al.
Published: (2024)
Deep Augmentation: Dropout as Augmentation for Self-Supervised Learning
by: Brüel-Gabrielsson, Rickard, et al.
Published: (2023)
by: Brüel-Gabrielsson, Rickard, et al.
Published: (2023)
Text-to-Image Cross-Modal Generation: A Systematic Review
by: Żelaszczyk, Maciej, et al.
Published: (2024)
by: Żelaszczyk, Maciej, et al.
Published: (2024)
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
by: Liu, Xiaoze, et al.
Published: (2026)
by: Liu, Xiaoze, et al.
Published: (2026)
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
by: Yang, Jianing, et al.
Published: (2024)
by: Yang, Jianing, et al.
Published: (2024)
Similarity-Dissimilarity Loss for Multi-label Supervised Contrastive Learning
by: Huang, Guangming, et al.
Published: (2024)
by: Huang, Guangming, et al.
Published: (2024)
Adaptive Sampling of k-Space in Magnetic Resonance for Rapid Pathology Prediction
by: Yen, Chen-Yu, et al.
Published: (2024)
by: Yen, Chen-Yu, et al.
Published: (2024)
PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems
by: Goel, Divyam, et al.
Published: (2026)
by: Goel, Divyam, et al.
Published: (2026)
Latent Action Pretraining from Videos
by: Ye, Seonghyeon, et al.
Published: (2024)
by: Ye, Seonghyeon, et al.
Published: (2024)
Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning
by: Yu, Yongcan, et al.
Published: (2025)
by: Yu, Yongcan, et al.
Published: (2025)
Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning
by: Huang, Siteng, et al.
Published: (2023)
by: Huang, Siteng, et al.
Published: (2023)
A Trust-Guided Approach to MR Image Reconstruction with Side Information
by: Atalık, Arda, et al.
Published: (2025)
by: Atalık, Arda, et al.
Published: (2025)
Similar Items
-
Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning
by: Madaan, Divyam, et al.
Published: (2024) -
Temporal Generalization: A Reality Check
by: Madaan, Divyam, et al.
Published: (2025) -
Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
by: Madaan, Divyam, et al.
Published: (2025) -
HIST-AID: Leveraging Historical Patient Reports for Enhanced Multi-Modal Automatic Diagnosis
by: Huang, Haoxu, et al.
Published: (2024) -
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
by: Huang, Chengyue, et al.
Published: (2025)