:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Zhu, Longyuan, Hua, Hairan, Miao, Linlin, Zhao, Bing
Formato:	Preprint
Publicado:	2026
Materias:	Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2602.11674
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent
por: Sui, Xingyu, et al.
Publicado: (2026)

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation
por: Akhtar, Mubashara, et al.
Publicado: (2026)

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
por: Wang, Zeyu, et al.
Publicado: (2026)

CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials
por: Xu, Chengliang, et al.
Publicado: (2026)

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation
por: Zhu, Hongda, et al.
Publicado: (2025)

Benchmarking Overton Pluralism in LLMs
por: Poole-Dayan, Elinor, et al.
Publicado: (2025)

PHM-Bench: A Domain-Specific Benchmarking Framework for Systematic Evaluation of Large Models in Prognostics and Health Management
por: Yang, Puyu, et al.
Publicado: (2025)

VCBench: Benchmarking LLMs in Venture Capital
por: Chen, Rick, et al.
Publicado: (2025)

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs
por: Wang, Ben, et al.
Publicado: (2026)

LLMTM: Benchmarking and Optimizing LLMs for Temporal Motif Analysis in Dynamic Graphs
por: Hao, Bing, et al.
Publicado: (2025)

Fluidity Index: Next-Generation Super-intelligence Benchmarks
por: Ngoiya, Eric, et al.
Publicado: (2025)

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation
por: Zhu, Jiace, et al.
Publicado: (2026)

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
por: Xu, Zhao, et al.
Publicado: (2024)

EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks
por: Yang, Xiao, et al.
Publicado: (2025)

Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis
por: Zhao, Jiaqi, et al.
Publicado: (2025)

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs
por: Wang, Haochuan, et al.
Publicado: (2024)

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
por: Fabbri, Alexander R., et al.
Publicado: (2025)

Rethinking Metrics and Benchmarks of Video Anomaly Detection
por: Liu, Zihao, et al.
Publicado: (2025)

Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs
por: Tie, Guiyao, et al.
Publicado: (2025)

SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code
por: Li, Xinghang, et al.
Publicado: (2025)

Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents
por: Shao, Jiaqi, et al.
Publicado: (2025)

INTEGRALBENCH: Benchmarking LLMs with Definite Integral Problems
por: Tang, Bintao, et al.
Publicado: (2025)

Design and Report Benchmarks for Knowledge Work
por: Hua, Yining, et al.
Publicado: (2026)

PepBenchmark: A Standardized Benchmark for Peptide Machine Learning
por: Zhang, Jiahui, et al.
Publicado: (2026)

GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs
por: Luo, Shixian, et al.
Publicado: (2025)

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
por: Hua, Tianyu, et al.
Publicado: (2025)

SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in Chinese
por: Xu, Liang, et al.
Publicado: (2024)

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs
por: Li, Yige, et al.
Publicado: (2026)

MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs
por: Zhang, Mengyuan, et al.
Publicado: (2024)

SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs
por: Liu, Zhiqiang, et al.
Publicado: (2025)

SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
por: Liu, Chaoqun, et al.
Publicado: (2025)

PFMBench: Protein Foundation Model Benchmark
por: Gao, Zhangyang, et al.
Publicado: (2025)

VFLAIR-LLM: A Comprehensive Framework and Benchmark for Split Learning of LLMs
por: Gu, Zixuan, et al.
Publicado: (2025)

RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
por: Zhou, Ruiwen, et al.
Publicado: (2024)

Benchmarking LLMs for Predictive Applications in the Intensive Care Units
por: Malhotra, Chehak, et al.
Publicado: (2025)

MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs
por: Zhao, Chenchen, et al.
Publicado: (2025)

LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
por: Toker, Gilat, et al.
Publicado: (2026)

SmartPlay: A Benchmark for LLMs as Intelligent Agents
por: Wu, Yue, et al.
Publicado: (2023)

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
por: Zhu, Yuke, et al.
Publicado: (2020)

Benchmarking Concept-Spilling Across Languages in LLMs
por: Badanin, Ilia, et al.
Publicado: (2026)