Saved in:
| Main Authors: | Wang, Xiang, Zhang, Tingting, Wang, Sen, Wu, Ying, Meng, Heng, Zhou, Peng, Li, Peng |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.28032 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
by: Zhou, Xiyuan, et al.
Published: (2025)
by: Zhou, Xiyuan, et al.
Published: (2025)
CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment
by: Feng, Rui, et al.
Published: (2025)
by: Feng, Rui, et al.
Published: (2025)
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
by: Xu, Peiran, et al.
Published: (2025)
by: Xu, Peiran, et al.
Published: (2025)
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models
by: Yuan, Botai, et al.
Published: (2025)
by: Yuan, Botai, et al.
Published: (2025)
MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science
by: Zhang, Junkai, et al.
Published: (2025)
by: Zhang, Junkai, et al.
Published: (2025)
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
by: Qiu, Jielin, et al.
Published: (2025)
by: Qiu, Jielin, et al.
Published: (2025)
OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology
by: Zhou, Chengfeng, et al.
Published: (2025)
by: Zhou, Chengfeng, et al.
Published: (2025)
SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy
by: Xiao, Peiyao, et al.
Published: (2026)
by: Xiao, Peiyao, et al.
Published: (2026)
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models
by: Wang, Yuhang, et al.
Published: (2026)
by: Wang, Yuhang, et al.
Published: (2026)
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting
by: Wang, Zhensheng, et al.
Published: (2026)
by: Wang, Zhensheng, et al.
Published: (2026)
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation
by: Yin, Wenjing, et al.
Published: (2025)
by: Yin, Wenjing, et al.
Published: (2025)
SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth
by: Xing, Wenpeng, et al.
Published: (2025)
by: Xing, Wenpeng, et al.
Published: (2025)
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering
by: Li, Yinsheng, et al.
Published: (2025)
by: Li, Yinsheng, et al.
Published: (2025)
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
by: Zhou, Pengfei, et al.
Published: (2025)
by: Zhou, Pengfei, et al.
Published: (2025)
RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
by: Yang, Shuo, et al.
Published: (2025)
by: Yang, Shuo, et al.
Published: (2025)
MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models
by: Wang, Han, et al.
Published: (2026)
by: Wang, Han, et al.
Published: (2026)
Genshin: General Shield for Natural Language Processing with Large Language Models
by: Peng, Xiao, et al.
Published: (2024)
by: Peng, Xiao, et al.
Published: (2024)
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models
by: LI, Yizhi, et al.
Published: (2024)
by: LI, Yizhi, et al.
Published: (2024)
QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models
by: Wu, Yao, et al.
Published: (2026)
by: Wu, Yao, et al.
Published: (2026)
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
by: Zhu, Fengbin, et al.
Published: (2024)
by: Zhu, Fengbin, et al.
Published: (2024)
ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing
by: Wang, Jinzhi, et al.
Published: (2025)
by: Wang, Jinzhi, et al.
Published: (2025)
ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models
by: Wang, Tianlong, et al.
Published: (2026)
by: Wang, Tianlong, et al.
Published: (2026)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
by: Bao, Han, et al.
Published: (2024)
by: Bao, Han, et al.
Published: (2024)
MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models
by: Ding, Peng, et al.
Published: (2024)
by: Ding, Peng, et al.
Published: (2024)
SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks
by: Cao, Hongye, et al.
Published: (2025)
by: Cao, Hongye, et al.
Published: (2025)
CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering
by: Hu, Ruida, et al.
Published: (2024)
by: Hu, Ruida, et al.
Published: (2024)
TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health
by: Xiong, Zixin, et al.
Published: (2026)
by: Xiong, Zixin, et al.
Published: (2026)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)
by: Yang, Rui, et al.
Published: (2025)
TaskBench: Benchmarking Large Language Models for Task Automation
by: Shen, Yongliang, et al.
Published: (2023)
by: Shen, Yongliang, et al.
Published: (2023)
MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models
by: Tang, Zecheng, et al.
Published: (2026)
by: Tang, Zecheng, et al.
Published: (2026)
Does Unification Come at a Cost? Uni-SafeBench: A Safety Benchmark for Unified Multimodal Large Models
by: Peng, Zixiang, et al.
Published: (2026)
by: Peng, Zixiang, et al.
Published: (2026)
AlignBench: Benchmarking Chinese Alignment of Large Language Models
by: Liu, Xiao, et al.
Published: (2023)
by: Liu, Xiao, et al.
Published: (2023)
PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models
by: Zhuang, Wanru, et al.
Published: (2025)
by: Zhuang, Wanru, et al.
Published: (2025)
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models
by: Hu, He, et al.
Published: (2025)
by: Hu, He, et al.
Published: (2025)
Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition
by: Zheng, Yushuo, et al.
Published: (2026)
by: Zheng, Yushuo, et al.
Published: (2026)
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models
by: Zhang, Wenjing, et al.
Published: (2024)
by: Zhang, Wenjing, et al.
Published: (2024)
Mamba-MOC: A Multicategory Remote Object Counting via State Space Model
by: Liu, Peng, et al.
Published: (2025)
by: Liu, Peng, et al.
Published: (2025)
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
by: Jiang, Longteng, et al.
Published: (2026)
by: Jiang, Longteng, et al.
Published: (2026)
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
by: Wang, Weida, et al.
Published: (2025)
by: Wang, Weida, et al.
Published: (2025)
AgriBench: A Hierarchical Agriculture Benchmark for Multimodal Large Language Models
by: Zhou, Yutong, et al.
Published: (2024)
by: Zhou, Yutong, et al.
Published: (2024)
Similar Items
-
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
by: Zhou, Xiyuan, et al.
Published: (2025) -
CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment
by: Feng, Rui, et al.
Published: (2025) -
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
by: Xu, Peiran, et al.
Published: (2025) -
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models
by: Yuan, Botai, et al.
Published: (2025) -
MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science
by: Zhang, Junkai, et al.
Published: (2025)