Guardado en:
| Autores principales: | Liu, Yuxuan, Shi, Yuntian, Wang, Kun, Shen, Haoting, Yang, Kun |
|---|---|
| Formato: | Preprint |
| Publicado: |
2026
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2602.03263 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension
por: Ouyang, Kun, et al.
Publicado: (2024)
por: Ouyang, Kun, et al.
Publicado: (2024)
Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents
por: Shen, Yiting, et al.
Publicado: (2026)
por: Shen, Yiting, et al.
Publicado: (2026)
MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
por: Zong, Xuanjun, et al.
Publicado: (2025)
por: Zong, Xuanjun, et al.
Publicado: (2025)
ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
por: Xu, Qiang, et al.
Publicado: (2026)
por: Xu, Qiang, et al.
Publicado: (2026)
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
por: Zeng, Zhen, et al.
Publicado: (2026)
por: Zeng, Zhen, et al.
Publicado: (2026)
Pragma-VL: Towards a Pragmatic Arbitration of Safety and Helpfulness in MLLMs
por: Wen, Ming, et al.
Publicado: (2026)
por: Wen, Ming, et al.
Publicado: (2026)
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
por: Li, Caorui, et al.
Publicado: (2025)
por: Li, Caorui, et al.
Publicado: (2025)
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
por: Chen, Jingxuan, et al.
Publicado: (2024)
por: Chen, Jingxuan, et al.
Publicado: (2024)
CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories
por: Xiao, Yijia, et al.
Publicado: (2025)
por: Xiao, Yijia, et al.
Publicado: (2025)
From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalities
por: Jiang, Shixin, et al.
Publicado: (2024)
por: Jiang, Shixin, et al.
Publicado: (2024)
MileBench: Benchmarking MLLMs in Long Context
por: Song, Dingjie, et al.
Publicado: (2024)
por: Song, Dingjie, et al.
Publicado: (2024)
CityTrajBench: A Unified Benchmark for City-Scale Vehicle Trajectory Generation
por: Zhu, Shibo, et al.
Publicado: (2026)
por: Zhu, Shibo, et al.
Publicado: (2026)
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs
por: Xu, Pengju, et al.
Publicado: (2025)
por: Xu, Pengju, et al.
Publicado: (2025)
LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
por: Fein, Daniel, et al.
Publicado: (2025)
por: Fein, Daniel, et al.
Publicado: (2025)
EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation of EEG Foundation Models
por: Xiong, Wei, et al.
Publicado: (2025)
por: Xiong, Wei, et al.
Publicado: (2025)
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
por: Liu, Mianxin, et al.
Publicado: (2024)
por: Liu, Mianxin, et al.
Publicado: (2024)
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models
por: Zhang, Wenjing, et al.
Publicado: (2024)
por: Zhang, Wenjing, et al.
Publicado: (2024)
ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines
por: Jin, Tengjun, et al.
Publicado: (2025)
por: Jin, Tengjun, et al.
Publicado: (2025)
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
por: Yang, Tianzhuo, et al.
Publicado: (2026)
por: Yang, Tianzhuo, et al.
Publicado: (2026)
Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs
por: Anand, Dhruv, et al.
Publicado: (2025)
por: Anand, Dhruv, et al.
Publicado: (2025)
OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences
por: Wen, Ming, et al.
Publicado: (2026)
por: Wen, Ming, et al.
Publicado: (2026)
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity
por: Choi, Dasol, et al.
Publicado: (2026)
por: Choi, Dasol, et al.
Publicado: (2026)
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
por: Jiang, Feng, et al.
Publicado: (2026)
por: Jiang, Feng, et al.
Publicado: (2026)
BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments
por: Li, Yuxuan, et al.
Publicado: (2026)
por: Li, Yuxuan, et al.
Publicado: (2026)
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
por: Levy, Ido, et al.
Publicado: (2024)
por: Levy, Ido, et al.
Publicado: (2024)
MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror
por: Guo, Shengyu, et al.
Publicado: (2026)
por: Guo, Shengyu, et al.
Publicado: (2026)
Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment
por: Wang, Kun, et al.
Publicado: (2026)
por: Wang, Kun, et al.
Publicado: (2026)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models
por: Fang, Junfeng, et al.
Publicado: (2025)
por: Fang, Junfeng, et al.
Publicado: (2025)
GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic
por: Zhang, Tianyuan, et al.
Publicado: (2026)
por: Zhang, Tianyuan, et al.
Publicado: (2026)
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
por: Xie, Sixiong, et al.
Publicado: (2026)
por: Xie, Sixiong, et al.
Publicado: (2026)
Affordance Benchmark for MLLMs
por: Wang, Junying, et al.
Publicado: (2025)
por: Wang, Junying, et al.
Publicado: (2025)
EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs
por: Dai, Yang, et al.
Publicado: (2026)
por: Dai, Yang, et al.
Publicado: (2026)
KidsArtBench: Multi-Dimensional Children's Art Evaluation with Attribute-Aware MLLMs
por: Ye, Mingrui, et al.
Publicado: (2025)
por: Ye, Mingrui, et al.
Publicado: (2025)
UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models
por: Chen, Chen, et al.
Publicado: (2025)
por: Chen, Chen, et al.
Publicado: (2025)
Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs
por: Ye, Hengwei, et al.
Publicado: (2026)
por: Ye, Hengwei, et al.
Publicado: (2026)
MatFormBench: A Benchmarking Evaluation Framework for Target-Driven Materials Formulation
por: Wu, Linhan, et al.
Publicado: (2026)
por: Wu, Linhan, et al.
Publicado: (2026)
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
por: Zhao, Haochen, et al.
Publicado: (2024)
por: Zhao, Haochen, et al.
Publicado: (2024)
StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs
por: Luo, Yang, et al.
Publicado: (2026)
por: Luo, Yang, et al.
Publicado: (2026)
SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems
por: Yang, Zonglin, et al.
Publicado: (2026)
por: Yang, Zonglin, et al.
Publicado: (2026)
Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model
por: Wang, Siyin, et al.
Publicado: (2024)
por: Wang, Siyin, et al.
Publicado: (2024)
Ejemplares similares
-
PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension
por: Ouyang, Kun, et al.
Publicado: (2024) -
Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents
por: Shen, Yiting, et al.
Publicado: (2026) -
MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
por: Zong, Xuanjun, et al.
Publicado: (2025) -
ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
por: Xu, Qiang, et al.
Publicado: (2026) -
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
por: Zeng, Zhen, et al.
Publicado: (2026)