Guardado en:
| Autores principales: | Zhu, Longyuan, Hua, Hairan, Miao, Linlin, Zhao, Bing |
|---|---|
| Formato: | Preprint |
| Publicado: |
2026
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2602.11674 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent
por: Sui, Xingyu, et al.
Publicado: (2026)
por: Sui, Xingyu, et al.
Publicado: (2026)
When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation
por: Akhtar, Mubashara, et al.
Publicado: (2026)
por: Akhtar, Mubashara, et al.
Publicado: (2026)
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
por: Wang, Zeyu, et al.
Publicado: (2026)
por: Wang, Zeyu, et al.
Publicado: (2026)
CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials
por: Xu, Chengliang, et al.
Publicado: (2026)
por: Xu, Chengliang, et al.
Publicado: (2026)
FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation
por: Zhu, Hongda, et al.
Publicado: (2025)
por: Zhu, Hongda, et al.
Publicado: (2025)
Benchmarking Overton Pluralism in LLMs
por: Poole-Dayan, Elinor, et al.
Publicado: (2025)
por: Poole-Dayan, Elinor, et al.
Publicado: (2025)
PHM-Bench: A Domain-Specific Benchmarking Framework for Systematic Evaluation of Large Models in Prognostics and Health Management
por: Yang, Puyu, et al.
Publicado: (2025)
por: Yang, Puyu, et al.
Publicado: (2025)
VCBench: Benchmarking LLMs in Venture Capital
por: Chen, Rick, et al.
Publicado: (2025)
por: Chen, Rick, et al.
Publicado: (2025)
BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs
por: Wang, Ben, et al.
Publicado: (2026)
por: Wang, Ben, et al.
Publicado: (2026)
LLMTM: Benchmarking and Optimizing LLMs for Temporal Motif Analysis in Dynamic Graphs
por: Hao, Bing, et al.
Publicado: (2025)
por: Hao, Bing, et al.
Publicado: (2025)
Fluidity Index: Next-Generation Super-intelligence Benchmarks
por: Ngoiya, Eric, et al.
Publicado: (2025)
por: Ngoiya, Eric, et al.
Publicado: (2025)
CUDABench: Benchmarking LLMs for Text-to-CUDA Generation
por: Zhu, Jiace, et al.
Publicado: (2026)
por: Zhu, Jiace, et al.
Publicado: (2026)
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
por: Xu, Zhao, et al.
Publicado: (2024)
por: Xu, Zhao, et al.
Publicado: (2024)
EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks
por: Yang, Xiao, et al.
Publicado: (2025)
por: Yang, Xiao, et al.
Publicado: (2025)
Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis
por: Zhao, Jiaqi, et al.
Publicado: (2025)
por: Zhao, Jiaqi, et al.
Publicado: (2025)
TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs
por: Wang, Haochuan, et al.
Publicado: (2024)
por: Wang, Haochuan, et al.
Publicado: (2024)
MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
por: Fabbri, Alexander R., et al.
Publicado: (2025)
por: Fabbri, Alexander R., et al.
Publicado: (2025)
Rethinking Metrics and Benchmarks of Video Anomaly Detection
por: Liu, Zihao, et al.
Publicado: (2025)
por: Liu, Zihao, et al.
Publicado: (2025)
Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs
por: Tie, Guiyao, et al.
Publicado: (2025)
por: Tie, Guiyao, et al.
Publicado: (2025)
SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code
por: Li, Xinghang, et al.
Publicado: (2025)
por: Li, Xinghang, et al.
Publicado: (2025)
Do LLM Agents Know How to Ground, Recover, and Assess? A Benchmark for Epistemic Competence in Information-Seeking Agents
por: Shao, Jiaqi, et al.
Publicado: (2025)
por: Shao, Jiaqi, et al.
Publicado: (2025)
INTEGRALBENCH: Benchmarking LLMs with Definite Integral Problems
por: Tang, Bintao, et al.
Publicado: (2025)
por: Tang, Bintao, et al.
Publicado: (2025)
Design and Report Benchmarks for Knowledge Work
por: Hua, Yining, et al.
Publicado: (2026)
por: Hua, Yining, et al.
Publicado: (2026)
PepBenchmark: A Standardized Benchmark for Peptide Machine Learning
por: Zhang, Jiahui, et al.
Publicado: (2026)
por: Zhang, Jiahui, et al.
Publicado: (2026)
GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs
por: Luo, Shixian, et al.
Publicado: (2025)
por: Luo, Shixian, et al.
Publicado: (2025)
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
por: Hua, Tianyu, et al.
Publicado: (2025)
por: Hua, Tianyu, et al.
Publicado: (2025)
SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in Chinese
por: Xu, Liang, et al.
Publicado: (2024)
por: Xu, Liang, et al.
Publicado: (2024)
Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs
por: Li, Yige, et al.
Publicado: (2026)
por: Li, Yige, et al.
Publicado: (2026)
MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs
por: Zhang, Mengyuan, et al.
Publicado: (2024)
por: Zhang, Mengyuan, et al.
Publicado: (2024)
SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs
por: Liu, Zhiqiang, et al.
Publicado: (2025)
por: Liu, Zhiqiang, et al.
Publicado: (2025)
SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
por: Liu, Chaoqun, et al.
Publicado: (2025)
por: Liu, Chaoqun, et al.
Publicado: (2025)
PFMBench: Protein Foundation Model Benchmark
por: Gao, Zhangyang, et al.
Publicado: (2025)
por: Gao, Zhangyang, et al.
Publicado: (2025)
VFLAIR-LLM: A Comprehensive Framework and Benchmark for Split Learning of LLMs
por: Gu, Zixuan, et al.
Publicado: (2025)
por: Gu, Zixuan, et al.
Publicado: (2025)
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
por: Zhou, Ruiwen, et al.
Publicado: (2024)
por: Zhou, Ruiwen, et al.
Publicado: (2024)
Benchmarking LLMs for Predictive Applications in the Intensive Care Units
por: Malhotra, Chehak, et al.
Publicado: (2025)
por: Malhotra, Chehak, et al.
Publicado: (2025)
MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs
por: Zhao, Chenchen, et al.
Publicado: (2025)
por: Zhao, Chenchen, et al.
Publicado: (2025)
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
por: Toker, Gilat, et al.
Publicado: (2026)
por: Toker, Gilat, et al.
Publicado: (2026)
SmartPlay: A Benchmark for LLMs as Intelligent Agents
por: Wu, Yue, et al.
Publicado: (2023)
por: Wu, Yue, et al.
Publicado: (2023)
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
por: Zhu, Yuke, et al.
Publicado: (2020)
por: Zhu, Yuke, et al.
Publicado: (2020)
Benchmarking Concept-Spilling Across Languages in LLMs
por: Badanin, Ilia, et al.
Publicado: (2026)
por: Badanin, Ilia, et al.
Publicado: (2026)
Ejemplares similares
-
TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent
por: Sui, Xingyu, et al.
Publicado: (2026) -
When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation
por: Akhtar, Mubashara, et al.
Publicado: (2026) -
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
por: Wang, Zeyu, et al.
Publicado: (2026) -
CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials
por: Xu, Chengliang, et al.
Publicado: (2026) -
FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation
por: Zhu, Hongda, et al.
Publicado: (2025)