:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Madaan, Divyam, Chopra, Sumit, Cho, Kyunghyun
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2602.16979
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning
by: Madaan, Divyam, et al.
Published: (2024)

Temporal Generalization: A Reality Check
by: Madaan, Divyam, et al.
Published: (2025)

Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
by: Madaan, Divyam, et al.
Published: (2025)

HIST-AID: Leveraging Historical Patient Reports for Enhanced Multi-Modal Automatic Diagnosis
by: Huang, Haoxu, et al.
Published: (2024)

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
by: Huang, Chengyue, et al.
Published: (2025)

Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models
by: Huang, Chengyue, et al.
Published: (2025)

Hyperparameters in Continual Learning: A Reality Check
by: Cha, Sungmin, et al.
Published: (2024)

Impact of Noisy Supervision in Foundation Model Learning
by: Chen, Hao, et al.
Published: (2024)

LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multimodal Large Language Models
by: Zhu, Mengdan, et al.
Published: (2024)

Everything is a Video: Unifying Modalities through Next-Frame Prediction
by: Hudson, G. Thomas, et al.
Published: (2024)

It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap
by: Fahim, Abrar, et al.
Published: (2024)

Multimodal Latent Language Modeling with Next-Token Diffusion
by: Sun, Yutao, et al.
Published: (2024)

Directional Gradient Projection for Robust Fine-Tuning of Foundation Models
by: Huang, Chengyue, et al.
Published: (2025)

X-VILA: Cross-Modality Alignment for Large Language Model
by: Ye, Hanrong, et al.
Published: (2024)

Vision-Language Models Create Cross-Modal Task Representations
by: Luo, Grace, et al.
Published: (2024)

SpurLens: Automatic Detection of Spurious Cues in Multimodal LLMs
by: Hosseini, Parsa, et al.
Published: (2025)

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
by: Zhou, Yiyang, et al.
Published: (2024)

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
by: Rajabi, Navid, et al.
Published: (2023)

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2023)

A training regime to learn unified representations from complementary breast imaging modalities
by: Sharma, Umang, et al.
Published: (2024)

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
by: Chen, Junyi, et al.
Published: (2023)

Text-centric Alignment for Multi-Modality Learning
by: Tsai, Yun-Da, et al.
Published: (2024)

Mem-W: Latent Memory-Native GUI Agents
by: Zhang, Guibin, et al.
Published: (2026)

CROME: Cross-Modal Adapters for Efficient Multimodal LLM
by: Ebrahimi, Sayna, et al.
Published: (2024)

EMMA: Efficient Visual Alignment in Multi-Modal LLMs
by: Ghazanfari, Sara, et al.
Published: (2024)

Multi-Modal Hallucination Control by Visual Information Grounding
by: Favero, Alessandro, et al.
Published: (2024)

Ethology of Latent Spaces
by: Boisnard, Philippe
Published: (2026)

Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
by: Liang, Weixin, et al.
Published: (2025)

Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation
by: Lin, Ci-Siang, et al.
Published: (2024)

Deep Augmentation: Dropout as Augmentation for Self-Supervised Learning
by: Brüel-Gabrielsson, Rickard, et al.
Published: (2023)

Text-to-Image Cross-Modal Generation: A Systematic Review
by: Żelaszczyk, Maciej, et al.
Published: (2024)

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
by: Liu, Xiaoze, et al.
Published: (2026)

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
by: Yang, Jianing, et al.
Published: (2024)

Similarity-Dissimilarity Loss for Multi-label Supervised Contrastive Learning
by: Huang, Guangming, et al.
Published: (2024)

Adaptive Sampling of k-Space in Magnetic Resonance for Rapid Pathology Prediction
by: Yen, Chen-Yu, et al.
Published: (2024)

PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems
by: Goel, Divyam, et al.
Published: (2026)

Latent Action Pretraining from Videos
by: Ye, Seonghyeon, et al.
Published: (2024)

Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning
by: Yu, Yongcan, et al.
Published: (2025)

Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning
by: Huang, Siteng, et al.
Published: (2023)

A Trust-Guided Approach to MR Image Reconstruction with Side Information
by: Atalık, Arda, et al.
Published: (2025)