:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Tong, Yijie, Hou, Yifan, Cui, Shaobo, Bosselut, Antoine, Sachan, Mrinmaya
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Computer Vision and Pattern Recognition Multimedia
Online Access:	https://arxiv.org/abs/2605.30713
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Unveiling the Visual Counting Bottleneck in Vision-Language Models
by: Pang, Xingzhou, et al.
Published: (2026)

Diversity-Guided MLP Reduction for Efficient Large Vision Transformers
by: Shen, Chengchao, et al.
Published: (2025)

Zero-shot image privacy classification with Vision-Language Models
by: Baia, Alina Elena, et al.
Published: (2025)

Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
by: Farina, Matteo, et al.
Published: (2025)

Language Models as Black-Box Optimizers for Vision-Language Models
by: Liu, Shihong, et al.
Published: (2023)

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
by: Sun, Zeyi, et al.
Published: (2024)

Detecting Content Rating Violations in Android Applications: A Vision-Language Approach
by: Denipitiyage, D., et al.
Published: (2025)

Test-Time Backdoor Attacks on Multimodal Large Language Models
by: Lu, Dong, et al.
Published: (2024)

Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models
by: Ghosh, Dhruba, et al.
Published: (2026)

RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models
by: Lin, Xiang, et al.
Published: (2025)

Reducing Hallucinations in Vision-Language Models via Latent Space Steering
by: Liu, Sheng, et al.
Published: (2024)

Cross-Modal Coordination Across a Diverse Set of Input Modalities
by: Sánchez, Jorge, et al.
Published: (2024)

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
by: Zhang, Yabin, et al.
Published: (2024)

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
by: Geng, Tiantian, et al.
Published: (2024)

Bridging Compressed Image Latents and Multimodal Large Language Models
by: Kao, Chia-Hao, et al.
Published: (2024)

Multimodal Transformer With a Low-Computational-Cost Guarantee
by: Park, Sungjin, et al.
Published: (2024)

LinVT: Empower Your Image-level Large Language Model to Understand Videos
by: Gao, Lishuai, et al.
Published: (2024)

Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis
by: Huang, Po-Hsuan, et al.
Published: (2024)

Discover Your Neighbors: Advanced Stable Test-Time Adaptation in Dynamic World
by: Jiang, Qinting, et al.
Published: (2024)

From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
by: Chen, Yiming, et al.
Published: (2025)

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
by: Zhao, Shuai, et al.
Published: (2023)

COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
by: Huang, Fanding, et al.
Published: (2025)

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
by: Deng, Ailin, et al.
Published: (2025)

3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis
by: Schulz, Stefan, et al.
Published: (2026)

Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration
by: Jiang, Xun, et al.
Published: (2026)

Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval
by: Li, Jun, et al.
Published: (2026)

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
by: Chen, Junyi, et al.
Published: (2023)

Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation
by: Vinod, Gautham, et al.
Published: (2026)

Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning
by: Chen, Yang, et al.
Published: (2024)

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
by: Han, Junlin, et al.
Published: (2025)

Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
by: Williams-Lekuona, Mikel, et al.
Published: (2025)

GroMo: Plant Growth Modeling with Multiview Images
by: Bhatt, Ruchi, et al.
Published: (2025)

Do Vision-Language Models Really Understand Visual Language?
by: Hou, Yifan, et al.
Published: (2024)

Improving Long-Text Alignment for Text-to-Image Diffusion Models
by: Liu, Luping, et al.
Published: (2024)

Deep Video Codec Control for Vision Models
by: Reich, Christoph, et al.
Published: (2023)

Unveiling Encoder-Free Vision-Language Models
by: Diao, Haiwen, et al.
Published: (2024)

Operationalizing Fairness in Text-to-Image Models: A Survey of Bias, Fairness Audits and Mitigation Strategies
by: Smith, Megan, et al.
Published: (2026)

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
by: Qu, Leigang, et al.
Published: (2025)

Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding
by: Deng, Ailin, et al.
Published: (2024)

PlanLLM: Video Procedure Planning with Refinable Large Language Models
by: Yang, Dejie, et al.
Published: (2024)