Saved in:
| Main Authors: | Wang, Kangkang, Jiang, Qinting, Zhang, Wanping, Ren, Bowen, Wen, Shengzhao |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.03485 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
by: Zhao, Bingchen, et al.
Published: (2024)
by: Zhao, Bingchen, et al.
Published: (2024)
Prompting Large Vision-Language Models for Compositional Reasoning
by: Ossowski, Timothy, et al.
Published: (2024)
by: Ossowski, Timothy, et al.
Published: (2024)
MoETTA: Test-Time Adaptation Under Mixed Distribution Shifts with MoE-LayerNorm
by: Fan, Xiao, et al.
Published: (2025)
by: Fan, Xiao, et al.
Published: (2025)
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves
by: Wang, Ruofan, et al.
Published: (2024)
by: Wang, Ruofan, et al.
Published: (2024)
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model
by: Wang, Sibo, et al.
Published: (2024)
by: Wang, Sibo, et al.
Published: (2024)
Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities
by: Li, Zhiyuan, et al.
Published: (2024)
by: Li, Zhiyuan, et al.
Published: (2024)
Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models
by: Chen, Yan, et al.
Published: (2025)
by: Chen, Yan, et al.
Published: (2025)
A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning
by: Jiang, Siyang, et al.
Published: (2025)
by: Jiang, Siyang, et al.
Published: (2025)
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
by: Zhang, Miaosen, et al.
Published: (2024)
by: Zhang, Miaosen, et al.
Published: (2024)
MIRAGE: A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence
by: Liu, Chonghan, et al.
Published: (2025)
by: Liu, Chonghan, et al.
Published: (2025)
GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs
by: Zheng, Guanghao, et al.
Published: (2025)
by: Zheng, Guanghao, et al.
Published: (2025)
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
by: Tian, Kexin, et al.
Published: (2025)
by: Tian, Kexin, et al.
Published: (2025)
ReasonEdit: Editing Vision-Language Models using Human Reasoning
by: Qiu, Jiaxing, et al.
Published: (2026)
by: Qiu, Jiaxing, et al.
Published: (2026)
The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency
by: Wang, Dingyu, et al.
Published: (2025)
by: Wang, Dingyu, et al.
Published: (2025)
VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation
by: Sajib, Rakib Hossain, et al.
Published: (2026)
by: Sajib, Rakib Hossain, et al.
Published: (2026)
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
by: Wang, Haozhe, et al.
Published: (2026)
by: Wang, Haozhe, et al.
Published: (2026)
MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?
by: Li, Guanzhen, et al.
Published: (2024)
by: Li, Guanzhen, et al.
Published: (2024)
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models
by: Song, Xiujie, et al.
Published: (2024)
by: Song, Xiujie, et al.
Published: (2024)
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
by: Zhong, Chen, et al.
Published: (2026)
by: Zhong, Chen, et al.
Published: (2026)
SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning
by: Perez, Alejandra, et al.
Published: (2026)
by: Perez, Alejandra, et al.
Published: (2026)
VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing
by: Luo, Zhiming, et al.
Published: (2026)
by: Luo, Zhiming, et al.
Published: (2026)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
by: Bao, Han, et al.
Published: (2024)
by: Bao, Han, et al.
Published: (2024)
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models
by: Yuan, Botai, et al.
Published: (2025)
by: Yuan, Botai, et al.
Published: (2025)
Feature-Based Instance Neighbor Discovery: Advanced Stable Test-Time Adaptation in Dynamic World
by: Jiang, Qinting, et al.
Published: (2025)
by: Jiang, Qinting, et al.
Published: (2025)
Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
by: Gu, Zheyuan, et al.
Published: (2026)
by: Gu, Zheyuan, et al.
Published: (2026)
VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
by: Lee, Daeun, et al.
Published: (2026)
by: Lee, Daeun, et al.
Published: (2026)
Discover Your Neighbors: Advanced Stable Test-Time Adaptation in Dynamic World
by: Jiang, Qinting, et al.
Published: (2024)
by: Jiang, Qinting, et al.
Published: (2024)
CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models
by: Li, Jingyao, et al.
Published: (2025)
by: Li, Jingyao, et al.
Published: (2025)
Benchmarking and Mitigating Sycophancy in Medical Vision Language Models
by: Xu, Juangui, et al.
Published: (2025)
by: Xu, Juangui, et al.
Published: (2025)
FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models
by: Pyo, Jiyoon, et al.
Published: (2025)
by: Pyo, Jiyoon, et al.
Published: (2025)
Explaining Multi-modal Large Language Models by Analyzing their Vision Perception
by: Giulivi, Loris, et al.
Published: (2024)
by: Giulivi, Loris, et al.
Published: (2024)
Improve Vision Language Model Chain-of-thought Reasoning
by: Zhang, Ruohong, et al.
Published: (2024)
by: Zhang, Ruohong, et al.
Published: (2024)
Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models
by: Yan, Bei, et al.
Published: (2024)
by: Yan, Bei, et al.
Published: (2024)
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
by: Cai, Huanqia, et al.
Published: (2025)
by: Cai, Huanqia, et al.
Published: (2025)
Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark
by: Mushkani, Rashid
Published: (2025)
by: Mushkani, Rashid
Published: (2025)
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
by: Jia, Mengdi, et al.
Published: (2025)
by: Jia, Mengdi, et al.
Published: (2025)
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
by: Chen, Qiguang, et al.
Published: (2026)
by: Chen, Qiguang, et al.
Published: (2026)
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
by: Pan, Chenbin, et al.
Published: (2025)
by: Pan, Chenbin, et al.
Published: (2025)
CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models
by: Cai, Jie, et al.
Published: (2025)
by: Cai, Jie, et al.
Published: (2025)
COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
by: Huang, Fanding, et al.
Published: (2025)
by: Huang, Fanding, et al.
Published: (2025)
Similar Items
-
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
by: Zhao, Bingchen, et al.
Published: (2024) -
Prompting Large Vision-Language Models for Compositional Reasoning
by: Ossowski, Timothy, et al.
Published: (2024) -
MoETTA: Test-Time Adaptation Under Mixed Distribution Shifts with MoE-LayerNorm
by: Fan, Xiao, et al.
Published: (2025) -
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves
by: Wang, Ruofan, et al.
Published: (2024) -
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model
by: Wang, Sibo, et al.
Published: (2024)