Saved in:
| Main Authors: | Tong, Yijie, Hou, Yifan, Cui, Shaobo, Bosselut, Antoine, Sachan, Mrinmaya |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.30713 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Unveiling the Visual Counting Bottleneck in Vision-Language Models
by: Pang, Xingzhou, et al.
Published: (2026)
by: Pang, Xingzhou, et al.
Published: (2026)
Diversity-Guided MLP Reduction for Efficient Large Vision Transformers
by: Shen, Chengchao, et al.
Published: (2025)
by: Shen, Chengchao, et al.
Published: (2025)
Zero-shot image privacy classification with Vision-Language Models
by: Baia, Alina Elena, et al.
Published: (2025)
by: Baia, Alina Elena, et al.
Published: (2025)
Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
by: Farina, Matteo, et al.
Published: (2025)
by: Farina, Matteo, et al.
Published: (2025)
Language Models as Black-Box Optimizers for Vision-Language Models
by: Liu, Shihong, et al.
Published: (2023)
by: Liu, Shihong, et al.
Published: (2023)
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
by: Sun, Zeyi, et al.
Published: (2024)
by: Sun, Zeyi, et al.
Published: (2024)
Detecting Content Rating Violations in Android Applications: A Vision-Language Approach
by: Denipitiyage, D., et al.
Published: (2025)
by: Denipitiyage, D., et al.
Published: (2025)
Test-Time Backdoor Attacks on Multimodal Large Language Models
by: Lu, Dong, et al.
Published: (2024)
by: Lu, Dong, et al.
Published: (2024)
Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models
by: Ghosh, Dhruba, et al.
Published: (2026)
by: Ghosh, Dhruba, et al.
Published: (2026)
RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models
by: Lin, Xiang, et al.
Published: (2025)
by: Lin, Xiang, et al.
Published: (2025)
Reducing Hallucinations in Vision-Language Models via Latent Space Steering
by: Liu, Sheng, et al.
Published: (2024)
by: Liu, Sheng, et al.
Published: (2024)
Cross-Modal Coordination Across a Diverse Set of Input Modalities
by: Sánchez, Jorge, et al.
Published: (2024)
by: Sánchez, Jorge, et al.
Published: (2024)
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
by: Zhang, Yabin, et al.
Published: (2024)
by: Zhang, Yabin, et al.
Published: (2024)
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
by: Geng, Tiantian, et al.
Published: (2024)
by: Geng, Tiantian, et al.
Published: (2024)
Bridging Compressed Image Latents and Multimodal Large Language Models
by: Kao, Chia-Hao, et al.
Published: (2024)
by: Kao, Chia-Hao, et al.
Published: (2024)
Multimodal Transformer With a Low-Computational-Cost Guarantee
by: Park, Sungjin, et al.
Published: (2024)
by: Park, Sungjin, et al.
Published: (2024)
LinVT: Empower Your Image-level Large Language Model to Understand Videos
by: Gao, Lishuai, et al.
Published: (2024)
by: Gao, Lishuai, et al.
Published: (2024)
Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis
by: Huang, Po-Hsuan, et al.
Published: (2024)
by: Huang, Po-Hsuan, et al.
Published: (2024)
Discover Your Neighbors: Advanced Stable Test-Time Adaptation in Dynamic World
by: Jiang, Qinting, et al.
Published: (2024)
by: Jiang, Qinting, et al.
Published: (2024)
From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
by: Chen, Yiming, et al.
Published: (2025)
by: Chen, Yiming, et al.
Published: (2025)
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
by: Zhao, Shuai, et al.
Published: (2023)
by: Zhao, Shuai, et al.
Published: (2023)
COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
by: Huang, Fanding, et al.
Published: (2025)
by: Huang, Fanding, et al.
Published: (2025)
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
by: Deng, Ailin, et al.
Published: (2025)
by: Deng, Ailin, et al.
Published: (2025)
3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis
by: Schulz, Stefan, et al.
Published: (2026)
by: Schulz, Stefan, et al.
Published: (2026)
Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration
by: Jiang, Xun, et al.
Published: (2026)
by: Jiang, Xun, et al.
Published: (2026)
Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval
by: Li, Jun, et al.
Published: (2026)
by: Li, Jun, et al.
Published: (2026)
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
by: Chen, Junyi, et al.
Published: (2023)
by: Chen, Junyi, et al.
Published: (2023)
Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation
by: Vinod, Gautham, et al.
Published: (2026)
by: Vinod, Gautham, et al.
Published: (2026)
Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning
by: Chen, Yang, et al.
Published: (2024)
by: Chen, Yang, et al.
Published: (2024)
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
by: Han, Junlin, et al.
Published: (2025)
by: Han, Junlin, et al.
Published: (2025)
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
by: Williams-Lekuona, Mikel, et al.
Published: (2025)
by: Williams-Lekuona, Mikel, et al.
Published: (2025)
GroMo: Plant Growth Modeling with Multiview Images
by: Bhatt, Ruchi, et al.
Published: (2025)
by: Bhatt, Ruchi, et al.
Published: (2025)
Do Vision-Language Models Really Understand Visual Language?
by: Hou, Yifan, et al.
Published: (2024)
by: Hou, Yifan, et al.
Published: (2024)
Improving Long-Text Alignment for Text-to-Image Diffusion Models
by: Liu, Luping, et al.
Published: (2024)
by: Liu, Luping, et al.
Published: (2024)
Deep Video Codec Control for Vision Models
by: Reich, Christoph, et al.
Published: (2023)
by: Reich, Christoph, et al.
Published: (2023)
Unveiling Encoder-Free Vision-Language Models
by: Diao, Haiwen, et al.
Published: (2024)
by: Diao, Haiwen, et al.
Published: (2024)
Operationalizing Fairness in Text-to-Image Models: A Survey of Bias, Fairness Audits and Mitigation Strategies
by: Smith, Megan, et al.
Published: (2026)
by: Smith, Megan, et al.
Published: (2026)
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
by: Qu, Leigang, et al.
Published: (2025)
by: Qu, Leigang, et al.
Published: (2025)
Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding
by: Deng, Ailin, et al.
Published: (2024)
by: Deng, Ailin, et al.
Published: (2024)
PlanLLM: Video Procedure Planning with Refinable Large Language Models
by: Yang, Dejie, et al.
Published: (2024)
by: Yang, Dejie, et al.
Published: (2024)
Similar Items
-
Unveiling the Visual Counting Bottleneck in Vision-Language Models
by: Pang, Xingzhou, et al.
Published: (2026) -
Diversity-Guided MLP Reduction for Efficient Large Vision Transformers
by: Shen, Chengchao, et al.
Published: (2025) -
Zero-shot image privacy classification with Vision-Language Models
by: Baia, Alina Elena, et al.
Published: (2025) -
Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
by: Farina, Matteo, et al.
Published: (2025) -
Language Models as Black-Box Optimizers for Vision-Language Models
by: Liu, Shihong, et al.
Published: (2023)