Saved in:
| Main Authors: | Lilienthal, Derek, Mukherjee, Manisha, Horawalavithana, Sameera |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.13993 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
by: Horawalavithana, Sameera, et al.
Published: (2026)
by: Horawalavithana, Sameera, et al.
Published: (2026)
SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions
by: Horawalavithana, Sameera, et al.
Published: (2023)
by: Horawalavithana, Sameera, et al.
Published: (2023)
On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
by: Ranjan, Mukul, et al.
Published: (2026)
by: Ranjan, Mukul, et al.
Published: (2026)
Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling
by: Chen, Qiyuan, et al.
Published: (2026)
by: Chen, Qiyuan, et al.
Published: (2026)
How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning
by: Yang, Luyu, et al.
Published: (2026)
by: Yang, Luyu, et al.
Published: (2026)
SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
by: Li, Hongxing, et al.
Published: (2025)
by: Li, Hongxing, et al.
Published: (2025)
Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning
by: Xiao, Wenyi, et al.
Published: (2025)
by: Xiao, Wenyi, et al.
Published: (2025)
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
by: Li, Hao, et al.
Published: (2023)
by: Li, Hao, et al.
Published: (2023)
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
by: Ding, Meidan, et al.
Published: (2025)
by: Ding, Meidan, et al.
Published: (2025)
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
by: Zhao, Bingchen, et al.
Published: (2024)
by: Zhao, Bingchen, et al.
Published: (2024)
VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
by: Xiao, Wenyi, et al.
Published: (2026)
by: Xiao, Wenyi, et al.
Published: (2026)
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
by: Jia, Mengdi, et al.
Published: (2025)
by: Jia, Mengdi, et al.
Published: (2025)
Data Metabolism: An Efficient Data Design Schema For Vision Language Model
by: Zhang, Jingyuan, et al.
Published: (2025)
by: Zhang, Jingyuan, et al.
Published: (2025)
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
by: Qu, Kevin, et al.
Published: (2026)
by: Qu, Kevin, et al.
Published: (2026)
Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models
by: Xiong, Guangzhi, et al.
Published: (2026)
by: Xiong, Guangzhi, et al.
Published: (2026)
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
by: Chen, Qiguang, et al.
Published: (2026)
by: Chen, Qiguang, et al.
Published: (2026)
CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
by: Pothiraj, Atin, et al.
Published: (2025)
by: Pothiraj, Atin, et al.
Published: (2025)
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models
by: Song, Xiujie, et al.
Published: (2024)
by: Song, Xiujie, et al.
Published: (2024)
Think Visually, Reason Textually: Vision-Language Synergy in ARC
by: Zhang, Beichen, et al.
Published: (2025)
by: Zhang, Beichen, et al.
Published: (2025)
CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries
by: Liu, Shudong, et al.
Published: (2025)
by: Liu, Shudong, et al.
Published: (2025)
CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design
by: Jeong, Daeheon, et al.
Published: (2025)
by: Jeong, Daeheon, et al.
Published: (2025)
Seeing Symbols, Missing Cultures: Probing Vision-Language Models' Reasoning on Fire Imagery and Cultural Meaning
by: Yu, Haorui, et al.
Published: (2025)
by: Yu, Haorui, et al.
Published: (2025)
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
by: Zou, Chengke, et al.
Published: (2024)
by: Zou, Chengke, et al.
Published: (2024)
NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models
by: Pandya, Pranshu, et al.
Published: (2024)
by: Pandya, Pranshu, et al.
Published: (2024)
Understanding the Effects of Distractors on Reasoning Vision-Language Models
by: Bae, Jiyun, et al.
Published: (2025)
by: Bae, Jiyun, et al.
Published: (2025)
ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks
by: Schroeder, Philip, et al.
Published: (2025)
by: Schroeder, Philip, et al.
Published: (2025)
DocReward: A Document Reward Model for Structuring and Stylizing
by: Liu, Junpeng, et al.
Published: (2025)
by: Liu, Junpeng, et al.
Published: (2025)
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
by: Hu, Wenbo, et al.
Published: (2025)
by: Hu, Wenbo, et al.
Published: (2025)
CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
by: Liu, Tianhui, et al.
Published: (2025)
by: Liu, Tianhui, et al.
Published: (2025)
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
by: Wang, Shenzhi, et al.
Published: (2026)
by: Wang, Shenzhi, et al.
Published: (2026)
MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains
by: Zhan, Zhaohuan, et al.
Published: (2024)
by: Zhan, Zhaohuan, et al.
Published: (2024)
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
by: Zhou, Gengze, et al.
Published: (2024)
by: Zhou, Gengze, et al.
Published: (2024)
Survey on Vision-Language-Action Models
by: Adilkhanov, Adilzhan, et al.
Published: (2025)
by: Adilkhanov, Adilzhan, et al.
Published: (2025)
Revisiting the Role of Language Priors in Vision-Language Models
by: Lin, Zhiqiu, et al.
Published: (2023)
by: Lin, Zhiqiu, et al.
Published: (2023)
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
by: Lu, Yujie, et al.
Published: (2024)
by: Lu, Yujie, et al.
Published: (2024)
Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space
by: Chen, Chao, et al.
Published: (2025)
by: Chen, Chao, et al.
Published: (2025)
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
by: Du, Yiyang, et al.
Published: (2026)
by: Du, Yiyang, et al.
Published: (2026)
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
by: Villegas, Danae Sánchez, et al.
Published: (2026)
by: Villegas, Danae Sánchez, et al.
Published: (2026)
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis
by: Nagar, Aishik, et al.
Published: (2024)
by: Nagar, Aishik, et al.
Published: (2024)
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
by: Prasad, Archiki, et al.
Published: (2023)
by: Prasad, Archiki, et al.
Published: (2023)
Similar Items
-
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
by: Horawalavithana, Sameera, et al.
Published: (2026) -
SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions
by: Horawalavithana, Sameera, et al.
Published: (2023) -
On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
by: Ranjan, Mukul, et al.
Published: (2026) -
Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling
by: Chen, Qiyuan, et al.
Published: (2026) -
How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning
by: Yang, Luyu, et al.
Published: (2026)