Saved in:
| Main Authors: | Shang, Fangxin, Xia, Yuan, Yang, Dalu, Wang, Yahui, Yang, Binglin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.16674 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
by: Ding, Meidan, et al.
Published: (2025)
by: Ding, Meidan, et al.
Published: (2025)
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
by: Shi, Baorong, et al.
Published: (2026)
by: Shi, Baorong, et al.
Published: (2026)
MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models
by: Yan, Qiao, et al.
Published: (2025)
by: Yan, Qiao, et al.
Published: (2025)
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
by: Zuo, Yuxin, et al.
Published: (2025)
by: Zuo, Yuxin, et al.
Published: (2025)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)
by: Yang, Rui, et al.
Published: (2025)
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
by: Wang, Fei, et al.
Published: (2024)
by: Wang, Fei, et al.
Published: (2024)
MedGemma Technical Report
by: Sellergren, Andrew, et al.
Published: (2025)
by: Sellergren, Andrew, et al.
Published: (2025)
Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing
by: Yang, Minglai, et al.
Published: (2026)
by: Yang, Minglai, et al.
Published: (2026)
FCMBench: The First Large-scale Financial Credit Multimodal Benchmark for Real-world Applications
by: Yang, Yehui, et al.
Published: (2026)
by: Yang, Yehui, et al.
Published: (2026)
MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis
by: Chen, Lei, et al.
Published: (2024)
by: Chen, Lei, et al.
Published: (2024)
Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants
by: Qin, Lixiong, et al.
Published: (2025)
by: Qin, Lixiong, et al.
Published: (2025)
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
by: Gao, Xiangbo, et al.
Published: (2026)
by: Gao, Xiangbo, et al.
Published: (2026)
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
by: Feng, Weixi, et al.
Published: (2024)
by: Feng, Weixi, et al.
Published: (2024)
KorMedMCQA-V: A Multimodal Benchmark for Evaluating Vision-Language Models on the Korean Medical Licensing Examination
by: Choi, Byungjin, et al.
Published: (2026)
by: Choi, Byungjin, et al.
Published: (2026)
AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
by: Zhou, Ziwei, et al.
Published: (2026)
by: Zhou, Ziwei, et al.
Published: (2026)
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
by: Ghaboura, Sara, et al.
Published: (2024)
by: Ghaboura, Sara, et al.
Published: (2024)
MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage
by: Khan, Ufaq, et al.
Published: (2026)
by: Khan, Ufaq, et al.
Published: (2026)
SegICL: A Multimodal In-context Learning Framework for Enhanced Segmentation in Medical Imaging
by: Shen, Lingdong, et al.
Published: (2024)
by: Shen, Lingdong, et al.
Published: (2024)
Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias
by: Wan, Zhongwei, et al.
Published: (2023)
by: Wan, Zhongwei, et al.
Published: (2023)
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
by: Kil, Jihyung, et al.
Published: (2024)
by: Kil, Jihyung, et al.
Published: (2024)
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models
by: Liu, Ziqiang, et al.
Published: (2024)
by: Liu, Ziqiang, et al.
Published: (2024)
VinaBench: Benchmark for Faithful and Consistent Visual Narratives
by: Gao, Silin, et al.
Published: (2025)
by: Gao, Silin, et al.
Published: (2025)
SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding
by: Zhang, Chenkai, et al.
Published: (2025)
by: Zhang, Chenkai, et al.
Published: (2025)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
by: Cai, Mu, et al.
Published: (2024)
by: Cai, Mu, et al.
Published: (2024)
Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs
by: Anand, Dhruv, et al.
Published: (2025)
by: Anand, Dhruv, et al.
Published: (2025)
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
by: Qi, Yukun, et al.
Published: (2025)
by: Qi, Yukun, et al.
Published: (2025)
UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
by: Lee, Segyu, et al.
Published: (2026)
by: Lee, Segyu, et al.
Published: (2026)
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
by: Li, Shilong, et al.
Published: (2025)
by: Li, Shilong, et al.
Published: (2025)
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
by: Wang, Andong, et al.
Published: (2024)
by: Wang, Andong, et al.
Published: (2024)
OmniBench: Towards The Future of Universal Omni-Language Models
by: Li, Yizhi, et al.
Published: (2024)
by: Li, Yizhi, et al.
Published: (2024)
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models
by: Yuan, Botai, et al.
Published: (2025)
by: Yuan, Botai, et al.
Published: (2025)
Hallucination Benchmark in Medical Visual Question Answering
by: Wu, Jinge, et al.
Published: (2024)
by: Wu, Jinge, et al.
Published: (2024)
VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation
by: Lim, Hyeonseok, et al.
Published: (2024)
by: Lim, Hyeonseok, et al.
Published: (2024)
MileBench: Benchmarking MLLMs in Long Context
by: Song, Dingjie, et al.
Published: (2024)
by: Song, Dingjie, et al.
Published: (2024)
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems
by: Xue, Xiangyuan, et al.
Published: (2024)
by: Xue, Xiangyuan, et al.
Published: (2024)
Benchmarking Robustness of Contrastive Learning Models for Medical Image-Report Retrieval
by: Deanda, Demetrio, et al.
Published: (2025)
by: Deanda, Demetrio, et al.
Published: (2025)
MultiMed: Massively Multimodal and Multitask Medical Understanding
by: Mo, Shentong, et al.
Published: (2024)
by: Mo, Shentong, et al.
Published: (2024)
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression
by: Kundu, Souvik, et al.
Published: (2025)
by: Kundu, Souvik, et al.
Published: (2025)
Boosting Medical Image-based Cancer Detection via Text-guided Supervision from Reports
by: Guo, Guangyu, et al.
Published: (2024)
by: Guo, Guangyu, et al.
Published: (2024)
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
by: Liang, Yijun, et al.
Published: (2025)
by: Liang, Yijun, et al.
Published: (2025)
Similar Items
-
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
by: Ding, Meidan, et al.
Published: (2025) -
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
by: Shi, Baorong, et al.
Published: (2026) -
MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models
by: Yan, Qiao, et al.
Published: (2025) -
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
by: Zuo, Yuxin, et al.
Published: (2025) -
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)