Guardado en:
| Autores principales: | Wang, Tianlong, Wang, Pinqiao, Shi, Weili, li, Sheng |
|---|---|
| Formato: | Preprint |
| Publicado: |
2026
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2603.19515 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Multi-Agent Debate: A Unified Agentic Framework for Tabular Anomaly Detection
por: Wang, Pinqiao, et al.
Publicado: (2026)
por: Wang, Pinqiao, et al.
Publicado: (2026)
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
por: Xu, Peiran, et al.
Publicado: (2025)
por: Xu, Peiran, et al.
Publicado: (2025)
CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment
por: Feng, Rui, et al.
Publicado: (2025)
por: Feng, Rui, et al.
Publicado: (2025)
Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification
por: Shi, Weili, et al.
Publicado: (2026)
por: Shi, Weili, et al.
Publicado: (2026)
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models
por: Zhang, Lingfeng, et al.
Publicado: (2024)
por: Zhang, Lingfeng, et al.
Publicado: (2024)
PetroBench: A Benchmark for Large Language Models in Petroleum Engineering
por: Wang, Xiang, et al.
Publicado: (2026)
por: Wang, Xiang, et al.
Publicado: (2026)
UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
por: Zheng, Yu, et al.
Publicado: (2025)
por: Zheng, Yu, et al.
Publicado: (2025)
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
por: Dong, Nguyen Tien, et al.
Publicado: (2025)
por: Dong, Nguyen Tien, et al.
Publicado: (2025)
SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy
por: Xiao, Peiyao, et al.
Publicado: (2026)
por: Xiao, Peiyao, et al.
Publicado: (2026)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting
por: Jiang, Jiyue, et al.
Publicado: (2025)
por: Jiang, Jiyue, et al.
Publicado: (2025)
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models
por: Wang, Yuhang, et al.
Publicado: (2026)
por: Wang, Yuhang, et al.
Publicado: (2026)
FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning
por: Shen, Xu, et al.
Publicado: (2025)
por: Shen, Xu, et al.
Publicado: (2025)
ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research
por: Shen, Hao, et al.
Publicado: (2026)
por: Shen, Hao, et al.
Publicado: (2026)
MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages
por: Han, Wenhan, et al.
Publicado: (2025)
por: Han, Wenhan, et al.
Publicado: (2025)
CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials
por: Xu, Chengliang, et al.
Publicado: (2026)
por: Xu, Chengliang, et al.
Publicado: (2026)
MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science
por: Zhang, Junkai, et al.
Publicado: (2025)
por: Zhang, Junkai, et al.
Publicado: (2025)
OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology
por: Zhou, Chengfeng, et al.
Publicado: (2025)
por: Zhou, Chengfeng, et al.
Publicado: (2025)
OR-Bench: An Over-Refusal Benchmark for Large Language Models
por: Cui, Justin, et al.
Publicado: (2024)
por: Cui, Justin, et al.
Publicado: (2024)
AlignBench: Benchmarking Chinese Alignment of Large Language Models
por: Liu, Xiao, et al.
Publicado: (2023)
por: Liu, Xiao, et al.
Publicado: (2023)
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
por: Qiu, Lu, et al.
Publicado: (2024)
por: Qiu, Lu, et al.
Publicado: (2024)
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting
por: Wang, Zhensheng, et al.
Publicado: (2026)
por: Wang, Zhensheng, et al.
Publicado: (2026)
SConU: Selective Conformal Uncertainty in Large Language Models
por: Wang, Zhiyuan, et al.
Publicado: (2025)
por: Wang, Zhiyuan, et al.
Publicado: (2025)
MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models
por: Wang, Han, et al.
Publicado: (2026)
por: Wang, Han, et al.
Publicado: (2026)
MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
por: Pandit, Shrey, et al.
Publicado: (2025)
por: Pandit, Shrey, et al.
Publicado: (2025)
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
por: Zhou, Xiyuan, et al.
Publicado: (2025)
por: Zhou, Xiyuan, et al.
Publicado: (2025)
Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition
por: Zheng, Yushuo, et al.
Publicado: (2026)
por: Zheng, Yushuo, et al.
Publicado: (2026)
MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis
por: Zhou, Yingjie, et al.
Publicado: (2024)
por: Zhou, Yingjie, et al.
Publicado: (2024)
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models
por: Hu, He, et al.
Publicado: (2025)
por: Hu, He, et al.
Publicado: (2025)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
por: Bao, Han, et al.
Publicado: (2024)
por: Bao, Han, et al.
Publicado: (2024)
TaskBench: Benchmarking Large Language Models for Task Automation
por: Shen, Yongliang, et al.
Publicado: (2023)
por: Shen, Yongliang, et al.
Publicado: (2023)
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy
por: Zhang, Mian, et al.
Publicado: (2024)
por: Zhang, Mian, et al.
Publicado: (2024)
TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health
por: Xiong, Zixin, et al.
Publicado: (2026)
por: Xiong, Zixin, et al.
Publicado: (2026)
SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models
por: Monti, Sebastiano, et al.
Publicado: (2026)
por: Monti, Sebastiano, et al.
Publicado: (2026)
LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges
por: Li, Hao, et al.
Publicado: (2026)
por: Li, Hao, et al.
Publicado: (2026)
MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models
por: Tang, Zecheng, et al.
Publicado: (2026)
por: Tang, Zecheng, et al.
Publicado: (2026)
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models
por: Yuan, Botai, et al.
Publicado: (2025)
por: Yuan, Botai, et al.
Publicado: (2025)
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models
por: Zhang, Wenjing, et al.
Publicado: (2024)
por: Zhang, Wenjing, et al.
Publicado: (2024)
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
por: Liu, Mianxin, et al.
Publicado: (2024)
por: Liu, Mianxin, et al.
Publicado: (2024)
ElecBench: a Power Dispatch Evaluation Benchmark for Large Language Models
por: Zhou, Xiyuan, et al.
Publicado: (2024)
por: Zhou, Xiyuan, et al.
Publicado: (2024)
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
por: Li, Xiangyi, et al.
Publicado: (2026)
por: Li, Xiangyi, et al.
Publicado: (2026)
Ejemplares similares
-
Multi-Agent Debate: A Unified Agentic Framework for Tabular Anomaly Detection
por: Wang, Pinqiao, et al.
Publicado: (2026) -
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
por: Xu, Peiran, et al.
Publicado: (2025) -
CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment
por: Feng, Rui, et al.
Publicado: (2025) -
Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification
por: Shi, Weili, et al.
Publicado: (2026) -
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models
por: Zhang, Lingfeng, et al.
Publicado: (2024)