Saved in:
| Main Authors: | Cheng, Wanyin, Ruan, Zanxi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.06010 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
by: Ye, Zixuan, et al.
Published: (2025)
by: Ye, Zixuan, et al.
Published: (2025)
StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
by: Ruan, Zanxi, et al.
Published: (2026)
by: Ruan, Zanxi, et al.
Published: (2026)
LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
by: Huang, Jing, et al.
Published: (2025)
by: Huang, Jing, et al.
Published: (2025)
Long-Form Answers to Visual Questions from Blind and Low Vision People
by: Huh, Mina, et al.
Published: (2024)
by: Huh, Mina, et al.
Published: (2024)
Question-Aware Gaussian Experts for Audio-Visual Question Answering
by: Kim, Hongyeob, et al.
Published: (2025)
by: Kim, Hongyeob, et al.
Published: (2025)
Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering
by: Fan, Lin, et al.
Published: (2026)
by: Fan, Lin, et al.
Published: (2026)
Visually Interpretable Subtask Reasoning for Visual Question Answering
by: Cheng, Yu, et al.
Published: (2025)
by: Cheng, Yu, et al.
Published: (2025)
Analyzing the Sensitivity of Vision Language Models in Visual Question Answering
by: Shah, Monika, et al.
Published: (2025)
by: Shah, Monika, et al.
Published: (2025)
Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering
by: Khan, Zaid, et al.
Published: (2024)
by: Khan, Zaid, et al.
Published: (2024)
VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness
by: Zhang, Rongyu, et al.
Published: (2024)
by: Zhang, Rongyu, et al.
Published: (2024)
Is SAM3 ready for pathology segmentation?
by: Kong, Qiuyu, et al.
Published: (2026)
by: Kong, Qiuyu, et al.
Published: (2026)
MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering
by: Peng, Jingwei, et al.
Published: (2025)
by: Peng, Jingwei, et al.
Published: (2025)
Latent Space Synergy: Text-Guided Data Augmentation for Direct Diffusion Biomedical Segmentation
by: Aqeel, Muhammad, et al.
Published: (2025)
by: Aqeel, Muhammad, et al.
Published: (2025)
Large Vision-Language Models for Remote Sensing Visual Question Answering
by: Siripong, Surasakdi, et al.
Published: (2024)
by: Siripong, Surasakdi, et al.
Published: (2024)
Selectively Answering Visual Questions
by: Eisenschlos, Julian Martin, et al.
Published: (2024)
by: Eisenschlos, Julian Martin, et al.
Published: (2024)
Privacy-Aware Document Visual Question Answering
by: Tito, Rubèn, et al.
Published: (2023)
by: Tito, Rubèn, et al.
Published: (2023)
PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
by: Qian, Yusu, et al.
Published: (2025)
by: Qian, Yusu, et al.
Published: (2025)
Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
by: Lan, Jian, et al.
Published: (2025)
by: Lan, Jian, et al.
Published: (2025)
Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation
by: Nikandrou, Malvina, et al.
Published: (2024)
by: Nikandrou, Malvina, et al.
Published: (2024)
Questioning the Stability of Visual Question Answering
by: Rosenfeld, Amir, et al.
Published: (2025)
by: Rosenfeld, Amir, et al.
Published: (2025)
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
by: Ma, Ziyu, et al.
Published: (2024)
by: Ma, Ziyu, et al.
Published: (2024)
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
by: Gao, Minghe, et al.
Published: (2025)
by: Gao, Minghe, et al.
Published: (2025)
Variational Visual Question Answering for Uncertainty-Aware Selective Prediction
by: Wieczorek, Tobias Jan, et al.
Published: (2025)
by: Wieczorek, Tobias Jan, et al.
Published: (2025)
Location-Aware Pretraining for Medical Difference Visual Question Answering
by: Musinguzi, Denis, et al.
Published: (2026)
by: Musinguzi, Denis, et al.
Published: (2026)
CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering
by: Zeng, Xiyin, et al.
Published: (2026)
by: Zeng, Xiyin, et al.
Published: (2026)
Targeted Visual Prompting for Medical Visual Question Answering
by: Tascon-Morales, Sergio, et al.
Published: (2024)
by: Tascon-Morales, Sergio, et al.
Published: (2024)
Visual Robustness Benchmark for Visual Question Answering (VQA)
by: Ishmam, Md Farhan, et al.
Published: (2024)
by: Ishmam, Md Farhan, et al.
Published: (2024)
Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering
by: Ha, Cuong Nhat, et al.
Published: (2024)
by: Ha, Cuong Nhat, et al.
Published: (2024)
Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations
by: Yeh, Yahsin, et al.
Published: (2025)
by: Yeh, Yahsin, et al.
Published: (2025)
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
by: Wang, Lihong, et al.
Published: (2025)
by: Wang, Lihong, et al.
Published: (2025)
Adapting Lightweight Vision Language Models for Radiological Visual Question Answering
by: Shourya, Aditya, et al.
Published: (2025)
by: Shourya, Aditya, et al.
Published: (2025)
Evaluating Variance in Visual Question Answering Benchmarks
by: SR, Nikitha
Published: (2025)
by: SR, Nikitha
Published: (2025)
Multimodal Rationales for Explainable Visual Question Answering
by: Li, Kun, et al.
Published: (2024)
by: Li, Kun, et al.
Published: (2024)
LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing
by: Girella, Federico, et al.
Published: (2025)
by: Girella, Federico, et al.
Published: (2025)
DocVXQA: Context-Aware Visual Explanations for Document Question Answering
by: Souibgui, Mohamed Ali, et al.
Published: (2025)
by: Souibgui, Mohamed Ali, et al.
Published: (2025)
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
by: Romero, David, et al.
Published: (2024)
by: Romero, David, et al.
Published: (2024)
Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning
by: Kapuriya, Janak, et al.
Published: (2025)
by: Kapuriya, Janak, et al.
Published: (2025)
Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models
by: Meng, Tian, et al.
Published: (2024)
by: Meng, Tian, et al.
Published: (2024)
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
by: Xu, Guowei, et al.
Published: (2024)
by: Xu, Guowei, et al.
Published: (2024)
VietMEAgent: Culturally-Aware Few-Shot Multimodal Explanation for Vietnamese Visual Question Answering
by: Nguyen, Hai-Dang, et al.
Published: (2025)
by: Nguyen, Hai-Dang, et al.
Published: (2025)
Similar Items
-
Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
by: Ye, Zixuan, et al.
Published: (2025) -
StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
by: Ruan, Zanxi, et al.
Published: (2026) -
LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
by: Huang, Jing, et al.
Published: (2025) -
Long-Form Answers to Visual Questions from Blind and Low Vision People
by: Huh, Mina, et al.
Published: (2024) -
Question-Aware Gaussian Experts for Audio-Visual Question Answering
by: Kim, Hongyeob, et al.
Published: (2025)