Saved in:
| Main Authors: | Wang, Qidong, Hu, Junjie, Jiang, Ming |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.17941 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VideoAVE: A Multi-Attribute Video-to-Text Attribute Value Extraction Dataset and Benchmark Models
by: Cheng, Ming, et al.
Published: (2025)
by: Cheng, Ming, et al.
Published: (2025)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
by: Jia, Mengzhao, et al.
Published: (2024)
by: Jia, Mengzhao, et al.
Published: (2024)
Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
by: Koleilat, Taha, et al.
Published: (2026)
by: Koleilat, Taha, et al.
Published: (2026)
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
by: Huang, Qidong, et al.
Published: (2024)
by: Huang, Qidong, et al.
Published: (2024)
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
by: Wang, Qidong, et al.
Published: (2025)
by: Wang, Qidong, et al.
Published: (2025)
Pseudo-Prompt Generating in Pre-trained Vision-Language Models for Multi-Label Medical Image Classification
by: Ye, Yaoqin, et al.
Published: (2024)
by: Ye, Yaoqin, et al.
Published: (2024)
VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering
by: Wang, Zihu, et al.
Published: (2025)
by: Wang, Zihu, et al.
Published: (2025)
Can We Predict Performance of Large Models across Vision-Language Tasks?
by: Zhao, Qinyu, et al.
Published: (2024)
by: Zhao, Qinyu, et al.
Published: (2024)
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit
by: Chen, Qizhou, et al.
Published: (2024)
by: Chen, Qizhou, et al.
Published: (2024)
HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task
by: Tian, Yu, et al.
Published: (2024)
by: Tian, Yu, et al.
Published: (2024)
Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts
by: Wu, Xuyang, et al.
Published: (2024)
by: Wu, Xuyang, et al.
Published: (2024)
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
by: Xing, Long, et al.
Published: (2024)
by: Xing, Long, et al.
Published: (2024)
Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues
by: Zhang, Zory, et al.
Published: (2025)
by: Zhang, Zory, et al.
Published: (2025)
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems
by: Zhu, Zifeng, et al.
Published: (2024)
by: Zhu, Zifeng, et al.
Published: (2024)
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence
by: He, Jinghan, et al.
Published: (2024)
by: He, Jinghan, et al.
Published: (2024)
Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding
by: Yoon, Hee Suk, et al.
Published: (2026)
by: Yoon, Hee Suk, et al.
Published: (2026)
Conflict Adaptation in Vision-Language Models
by: Hu, Xiaoyang
Published: (2025)
by: Hu, Xiaoyang
Published: (2025)
The Abstraction Gap in Vision-Language Causal Reasoning
by: Hoang, Chinh, et al.
Published: (2026)
by: Hoang, Chinh, et al.
Published: (2026)
Prompting Large Vision-Language Models for Compositional Reasoning
by: Ossowski, Timothy, et al.
Published: (2024)
by: Ossowski, Timothy, et al.
Published: (2024)
Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings
by: Huemann, Zachary, et al.
Published: (2025)
by: Huemann, Zachary, et al.
Published: (2025)
Insight-A: Attribution-aware for Multimodal Misinformation Detection
by: Wu, Junjie, et al.
Published: (2025)
by: Wu, Junjie, et al.
Published: (2025)
Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations
by: Ratzlaff, Neale, et al.
Published: (2024)
by: Ratzlaff, Neale, et al.
Published: (2024)
VISIONLOGIC: From Neuron Activations to Causally Grounded Concept Rules for Vision Models
by: Geng, Chuqin, et al.
Published: (2025)
by: Geng, Chuqin, et al.
Published: (2025)
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
by: Kojima, Noriyuki, et al.
Published: (2023)
by: Kojima, Noriyuki, et al.
Published: (2023)
Dynamic Token Reweighting for Robust Vision-Language Models
by: Jiang, Tanqiu, et al.
Published: (2025)
by: Jiang, Tanqiu, et al.
Published: (2025)
PM4Bench: Benchmarking Large Vision-Language Models with Parallel Multilingual Multi-Modal Multi-task Corpus
by: Gao, Junyuan, et al.
Published: (2025)
by: Gao, Junyuan, et al.
Published: (2025)
Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts
by: Zhang, Zhaoyang, et al.
Published: (2023)
by: Zhang, Zhaoyang, et al.
Published: (2023)
Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation
by: Zha, Yuheng, et al.
Published: (2025)
by: Zha, Yuheng, et al.
Published: (2025)
GUICourse: From General Vision Language Models to Versatile GUI Agents
by: Chen, Wentong, et al.
Published: (2024)
by: Chen, Wentong, et al.
Published: (2024)
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
by: Jiang, Ziyan, et al.
Published: (2024)
by: Jiang, Ziyan, et al.
Published: (2024)
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering
by: Lim, Qi Zhi, et al.
Published: (2025)
by: Lim, Qi Zhi, et al.
Published: (2025)
Causal Graphical Models for Vision-Language Compositional Understanding
by: Parascandolo, Fiorenzo, et al.
Published: (2024)
by: Parascandolo, Fiorenzo, et al.
Published: (2024)
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
by: Wu, Yuhang, et al.
Published: (2024)
by: Wu, Yuhang, et al.
Published: (2024)
Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models
by: Xiong, Guangzhi, et al.
Published: (2026)
by: Xiong, Guangzhi, et al.
Published: (2026)
RiskCueBench: Benchmarking Anticipatory Reasoning from Early Risk Cues in Video-Language Models
by: Luo, Sha, et al.
Published: (2026)
by: Luo, Sha, et al.
Published: (2026)
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
by: Jiang, Lei, et al.
Published: (2025)
by: Jiang, Lei, et al.
Published: (2025)
TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions
by: He, Xingwei, et al.
Published: (2024)
by: He, Xingwei, et al.
Published: (2024)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs
by: Nguyen, Duy, et al.
Published: (2025)
by: Nguyen, Duy, et al.
Published: (2025)
DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding
by: Liu, Zixuan, et al.
Published: (2025)
by: Liu, Zixuan, et al.
Published: (2025)
Vision-Language Models Create Cross-Modal Task Representations
by: Luo, Grace, et al.
Published: (2024)
by: Luo, Grace, et al.
Published: (2024)
Similar Items
-
VideoAVE: A Multi-Attribute Video-to-Text Attribute Value Extraction Dataset and Benchmark Models
by: Cheng, Ming, et al.
Published: (2025) -
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
by: Jia, Mengzhao, et al.
Published: (2024) -
Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
by: Koleilat, Taha, et al.
Published: (2026) -
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
by: Huang, Qidong, et al.
Published: (2024) -
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
by: Wang, Qidong, et al.
Published: (2025)