Saved in:
| Main Authors: | Fazli, Mehrdad, Wei, Bowen, Zhu, Ziwei |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.05939 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
by: Fazli, Mehrdad, et al.
Published: (2025)
by: Fazli, Mehrdad, et al.
Published: (2025)
Long Context Transfer from Language to Vision
by: Zhang, Peiyuan, et al.
Published: (2024)
by: Zhang, Peiyuan, et al.
Published: (2024)
ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
by: Kang, Jialiang, et al.
Published: (2025)
by: Kang, Jialiang, et al.
Published: (2025)
Context-Aware Autoregressive Models for Multi-Conditional Image Generation
by: Chen, Yixiao, et al.
Published: (2025)
by: Chen, Yixiao, et al.
Published: (2025)
On the Faithfulness of Vision Transformer Explanations
by: Wu, Junyi, et al.
Published: (2024)
by: Wu, Junyi, et al.
Published: (2024)
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
by: Kulkarni, Yogesh, et al.
Published: (2024)
by: Kulkarni, Yogesh, et al.
Published: (2024)
Vision-Language Binding in In-Context Image Generation
by: Ge, Chris, et al.
Published: (2026)
by: Ge, Chris, et al.
Published: (2026)
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
by: Zheng, Dian, et al.
Published: (2025)
by: Zheng, Dian, et al.
Published: (2025)
Stencil: Subject-Driven Generation with Context Guidance
by: Chen, Gordon, et al.
Published: (2025)
by: Chen, Gordon, et al.
Published: (2025)
FaithScore: Fine-grained Evaluations of Hallucinations in Large Vision-Language Models
by: Jing, Liqiang, et al.
Published: (2023)
by: Jing, Liqiang, et al.
Published: (2023)
Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning
by: Liu, Shih-Wen, et al.
Published: (2025)
by: Liu, Shih-Wen, et al.
Published: (2025)
Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark
by: Zhou, Rulin, et al.
Published: (2025)
by: Zhou, Rulin, et al.
Published: (2025)
ChartQA-X: Generating Explanations for Visual Chart Reasoning
by: Hegde, Shamanthak, et al.
Published: (2025)
by: Hegde, Shamanthak, et al.
Published: (2025)
Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models
by: Liu, Ziwei, et al.
Published: (2025)
by: Liu, Ziwei, et al.
Published: (2025)
Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model
by: Nguyen, Duy M. H., et al.
Published: (2024)
by: Nguyen, Duy M. H., et al.
Published: (2024)
Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations
by: Moll, Johannes, et al.
Published: (2025)
by: Moll, Johannes, et al.
Published: (2025)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models
by: Qiu, Haoyi, et al.
Published: (2024)
by: Qiu, Haoyi, et al.
Published: (2024)
Large Vision-Language Models as Emotion Recognizers in Context Awareness
by: Lei, Yuxuan, et al.
Published: (2024)
by: Lei, Yuxuan, et al.
Published: (2024)
SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding
by: Li, Zhaoxu, et al.
Published: (2026)
by: Li, Zhaoxu, et al.
Published: (2026)
Context Diffusion: In-Context Aware Image Generation
by: Najdenkoska, Ivona, et al.
Published: (2023)
by: Najdenkoska, Ivona, et al.
Published: (2023)
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
by: Wang, Zhaowei, et al.
Published: (2026)
by: Wang, Zhaowei, et al.
Published: (2026)
Towards Multimodal In-Context Learning for Vision & Language Models
by: Doveh, Sivan, et al.
Published: (2024)
by: Doveh, Sivan, et al.
Published: (2024)
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
by: Yildirim, Selin, et al.
Published: (2026)
by: Yildirim, Selin, et al.
Published: (2026)
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding
by: Li, Chaoyu, et al.
Published: (2024)
by: Li, Chaoyu, et al.
Published: (2024)
Explanation-Driven Counterfactual Testing for Faithfulness in Vision-Language Model Explanations
by: Ding, Sihao, et al.
Published: (2025)
by: Ding, Sihao, et al.
Published: (2025)
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
by: Qi, Jianing, et al.
Published: (2025)
by: Qi, Jianing, et al.
Published: (2025)
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding
by: Zhu, Lanyun, et al.
Published: (2024)
by: Zhu, Lanyun, et al.
Published: (2024)
MMSpec: Benchmarking Speculative Decoding for Vision-Language Models
by: Shen, Hui, et al.
Published: (2026)
by: Shen, Hui, et al.
Published: (2026)
Negation-Aware Test-Time Adaptation for Vision-Language Models
by: Han, Haochen, et al.
Published: (2025)
by: Han, Haochen, et al.
Published: (2025)
EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning
by: Kulkarni, Yogesh, et al.
Published: (2025)
by: Kulkarni, Yogesh, et al.
Published: (2025)
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
by: Kulkarni, Yogesh, et al.
Published: (2025)
by: Kulkarni, Yogesh, et al.
Published: (2025)
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment
by: Kulkarni, Yogesh, et al.
Published: (2025)
by: Kulkarni, Yogesh, et al.
Published: (2025)
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients
by: Xiang, Ziwei, et al.
Published: (2026)
by: Xiang, Ziwei, et al.
Published: (2026)
Context-Aware Token Selection and Packing for Enhanced Vision Transformer
by: Zhang, Tianyi, et al.
Published: (2024)
by: Zhang, Tianyi, et al.
Published: (2024)
HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification
by: Ouyang, Shuyi, et al.
Published: (2024)
by: Ouyang, Shuyi, et al.
Published: (2024)
Step-Level Visual Grounding Faithfulness Predicts Out-of-Distribution Generalization in Long-Horizon Vision-Language Models
by: Rahman, Md Ashikur, et al.
Published: (2026)
by: Rahman, Md Ashikur, et al.
Published: (2026)
Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models
by: Zhang, Ce, et al.
Published: (2025)
by: Zhang, Ce, et al.
Published: (2025)
TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
by: Yang, Fan, et al.
Published: (2026)
by: Yang, Fan, et al.
Published: (2026)
Optimizing Vision-Language Interactions Through Decoder-Only Models
by: Tanaka, Kaito, et al.
Published: (2024)
by: Tanaka, Kaito, et al.
Published: (2024)
BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model
by: Li, Haosheng, et al.
Published: (2026)
by: Li, Haosheng, et al.
Published: (2026)
Similar Items
-
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
by: Fazli, Mehrdad, et al.
Published: (2025) -
Long Context Transfer from Language to Vision
by: Zhang, Peiyuan, et al.
Published: (2024) -
ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
by: Kang, Jialiang, et al.
Published: (2025) -
Context-Aware Autoregressive Models for Multi-Conditional Image Generation
by: Chen, Yixiao, et al.
Published: (2025) -
On the Faithfulness of Vision Transformer Explanations
by: Wu, Junyi, et al.
Published: (2024)