Saved in:
| Main Authors: | Luo, Xiaotian, Jiang, Xun, Wu, Jiangcheng |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.06846 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
by: Nguyen, Bang, et al.
Published: (2026)
by: Nguyen, Bang, et al.
Published: (2026)
MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models
by: Zuo, Kaiwen, et al.
Published: (2024)
by: Zuo, Kaiwen, et al.
Published: (2024)
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2025)
by: Tang, Xiangru, et al.
Published: (2025)
QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models
by: Wu, Yao, et al.
Published: (2026)
by: Wu, Yao, et al.
Published: (2026)
The 2nd FutureDial Challenge: Dialog Systems with Retrieval Augmented Generation (FutureDial-RAG)
by: Cai, Yucheng, et al.
Published: (2024)
by: Cai, Yucheng, et al.
Published: (2024)
Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs
by: Kim, Taejin, et al.
Published: (2025)
by: Kim, Taejin, et al.
Published: (2025)
SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks
by: Cao, Hongye, et al.
Published: (2025)
by: Cao, Hongye, et al.
Published: (2025)
MedRepBench: A Comprehensive Benchmark for Medical Report Interpretation
by: Shang, Fangxin, et al.
Published: (2025)
by: Shang, Fangxin, et al.
Published: (2025)
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
by: Yang, Wang, et al.
Published: (2025)
by: Yang, Wang, et al.
Published: (2025)
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
by: Ding, Meidan, et al.
Published: (2025)
by: Ding, Meidan, et al.
Published: (2025)
FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol
by: Zhu, Jie, et al.
Published: (2026)
by: Zhu, Jie, et al.
Published: (2026)
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
by: Liu, Mianxin, et al.
Published: (2024)
by: Liu, Mianxin, et al.
Published: (2024)
NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
by: Moore, Robert J., et al.
Published: (2026)
by: Moore, Robert J., et al.
Published: (2026)
SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs
by: Huang, Haiduo, et al.
Published: (2025)
by: Huang, Haiduo, et al.
Published: (2025)
MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs
by: Hsu, Hsin-Ling, et al.
Published: (2026)
by: Hsu, Hsin-Ling, et al.
Published: (2026)
MedExQA: Medical Question Answering Benchmark with Multiple Explanations
by: Kim, Yunsoo, et al.
Published: (2024)
by: Kim, Yunsoo, et al.
Published: (2024)
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
by: Deng, Shihan, et al.
Published: (2024)
by: Deng, Shihan, et al.
Published: (2024)
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
by: Jiang, Hongchao, et al.
Published: (2025)
by: Jiang, Hongchao, et al.
Published: (2025)
MedCalc-Bench: Evaluating Large Language Models for Medical Calculations
by: Khandekar, Nikhil, et al.
Published: (2024)
by: Khandekar, Nikhil, et al.
Published: (2024)
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge
by: Cantini, Riccardo, et al.
Published: (2025)
by: Cantini, Riccardo, et al.
Published: (2025)
GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration
by: Zhao, Junjie, et al.
Published: (2026)
by: Zhao, Junjie, et al.
Published: (2026)
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
by: Li, Weiyue, et al.
Published: (2026)
by: Li, Weiyue, et al.
Published: (2026)
Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning
by: Zhang, Xiaotian, et al.
Published: (2025)
by: Zhang, Xiaotian, et al.
Published: (2025)
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
by: Yang, Wang, et al.
Published: (2026)
by: Yang, Wang, et al.
Published: (2026)
MDD-5k: A New Diagnostic Conversation Dataset for Mental Disorders Synthesized via Neuro-Symbolic LLM Agents
by: Yin, Congchi, et al.
Published: (2024)
by: Yin, Congchi, et al.
Published: (2024)
Dial-MAE: ConTextual Masked Auto-Encoder for Retrieval-based Dialogue Systems
by: Su, Zhenpeng, et al.
Published: (2023)
by: Su, Zhenpeng, et al.
Published: (2023)
AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
by: Xiao, Jianfei, et al.
Published: (2026)
by: Xiao, Jianfei, et al.
Published: (2026)
HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning
by: Jiang, Zhuohang, et al.
Published: (2025)
by: Jiang, Zhuohang, et al.
Published: (2025)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data
by: Wang, Xuwu, et al.
Published: (2024)
by: Wang, Xuwu, et al.
Published: (2024)
Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
by: Galisai, Marcello, et al.
Published: (2026)
by: Galisai, Marcello, et al.
Published: (2026)
BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
by: Tu, Xinming, et al.
Published: (2026)
by: Tu, Xinming, et al.
Published: (2026)
WritingBench: A Comprehensive Benchmark for Generative Writing
by: Wu, Yuning, et al.
Published: (2025)
by: Wu, Yuning, et al.
Published: (2025)
MedQA-CS: Objective Structured Clinical Examination (OSCE)-Style Benchmark for Evaluating LLM Clinical Skills
by: Yao, Zonghai, et al.
Published: (2024)
by: Yao, Zonghai, et al.
Published: (2024)
Benchmarking and Improving LLM Robustness for Personalized Generation
by: Okite, Chimaobi, et al.
Published: (2025)
by: Okite, Chimaobi, et al.
Published: (2025)
oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning
by: Xu, Ruiling, et al.
Published: (2025)
by: Xu, Ruiling, et al.
Published: (2025)
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
by: He, Wei, et al.
Published: (2025)
by: He, Wei, et al.
Published: (2025)
FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
by: Lee, Gyubok, et al.
Published: (2025)
by: Lee, Gyubok, et al.
Published: (2025)
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
by: Sheshadri, Abhay, et al.
Published: (2024)
by: Sheshadri, Abhay, et al.
Published: (2024)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
by: Zhao, Haochen, et al.
Published: (2024)
by: Zhao, Haochen, et al.
Published: (2024)
Similar Items
-
ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
by: Nguyen, Bang, et al.
Published: (2026) -
MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models
by: Zuo, Kaiwen, et al.
Published: (2024) -
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
by: Tang, Xiangru, et al.
Published: (2025) -
QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models
by: Wu, Yao, et al.
Published: (2026) -
The 2nd FutureDial Challenge: Dialog Systems with Retrieval Augmented Generation (FutureDial-RAG)
by: Cai, Yucheng, et al.
Published: (2024)