Saved in:
| Main Authors: | Hou, Ruihui, Chen, Shencheng, Fan, Yongqi, Yu, Guangya, Zhu, Lifeng, Sun, Jing, Liu, Jingping, Ruan, Tong |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2408.10039 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation
by: Yu, Guangya, et al.
Published: (2025)
by: Yu, Guangya, et al.
Published: (2025)
KG-o1: Enhancing Multi-hop Question Answering in Large Language Models via Knowledge Graph Integration
by: Wang, Nan, et al.
Published: (2025)
by: Wang, Nan, et al.
Published: (2025)
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs
by: Fan, Yongqi, et al.
Published: (2025)
by: Fan, Yongqi, et al.
Published: (2025)
MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens
by: Fan, Yongqi, et al.
Published: (2024)
by: Fan, Yongqi, et al.
Published: (2024)
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models
by: Li, Xiaomin, et al.
Published: (2025)
by: Li, Xiaomin, et al.
Published: (2025)
CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation
by: Hou, Ruihui, et al.
Published: (2026)
by: Hou, Ruihui, et al.
Published: (2026)
Can Multimodal Large Language Models Understand Spatial Relations?
by: Liu, Jingping, et al.
Published: (2025)
by: Liu, Jingping, et al.
Published: (2025)
Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models
by: Huang, Zhongzhen, et al.
Published: (2024)
by: Huang, Zhongzhen, et al.
Published: (2024)
TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models
by: Zhang, Yiran, et al.
Published: (2025)
by: Zhang, Yiran, et al.
Published: (2025)
AF Adapter: Continual Pretraining for Building Chinese Biomedical Language Model
by: Yan, Yongyu, et al.
Published: (2022)
by: Yan, Yongyu, et al.
Published: (2022)
MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs
by: Zhang, Mengyuan, et al.
Published: (2024)
by: Zhang, Mengyuan, et al.
Published: (2024)
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
by: Liu, Mianxin, et al.
Published: (2024)
by: Liu, Mianxin, et al.
Published: (2024)
M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models
by: Kwan, Wai-Chung, et al.
Published: (2023)
by: Kwan, Wai-Chung, et al.
Published: (2023)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models
by: Kwan, Wai-Chung, et al.
Published: (2024)
by: Kwan, Wai-Chung, et al.
Published: (2024)
Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark
by: Li, Zheqing, et al.
Published: (2025)
by: Li, Zheqing, et al.
Published: (2025)
EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing
by: Gao, Fan, et al.
Published: (2025)
by: Gao, Fan, et al.
Published: (2025)
Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models
by: Gu, Xiaojie, et al.
Published: (2026)
by: Gu, Xiaojie, et al.
Published: (2026)
CLIMB: A Benchmark of Clinical Bias in Large Language Models
by: Zhang, Yubo, et al.
Published: (2024)
by: Zhang, Yubo, et al.
Published: (2024)
Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis
by: Fan, Xiaoyang, et al.
Published: (2026)
by: Fan, Xiaoyang, et al.
Published: (2026)
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation
by: Yin, Xunjian, et al.
Published: (2024)
by: Yin, Xunjian, et al.
Published: (2024)
SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research
by: Sun, Liangtai, et al.
Published: (2023)
by: Sun, Liangtai, et al.
Published: (2023)
CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
by: Yu, Linhao, et al.
Published: (2024)
by: Yu, Linhao, et al.
Published: (2024)
Distributionally Robust Chance-Constrained Flexibility Planning for Integrated Energy System
by: Zhan, Sen, et al.
Published: (2021)
by: Zhan, Sen, et al.
Published: (2021)
PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models
by: Liu, Yingen, et al.
Published: (2024)
by: Liu, Yingen, et al.
Published: (2024)
Large Language Models for Causal Discovery: Current Landscape and Future Directions
by: Wan, Guangya, et al.
Published: (2024)
by: Wan, Guangya, et al.
Published: (2024)
Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models
by: Yu, Wenhan, et al.
Published: (2025)
by: Yu, Wenhan, et al.
Published: (2025)
Generalizing Fair Top-$k$ Selection: An Integrative Approach
by: Cai, Guangya
Published: (2026)
by: Cai, Guangya
Published: (2026)
Finding a Fair Scoring Function for Top-$k$ Selection: From Hardness to Practice
by: Cai, Guangya
Published: (2025)
by: Cai, Guangya
Published: (2025)
Evaluating Proactive Risk Awareness of Large Language Models
by: Luo, Xuan, et al.
Published: (2026)
by: Luo, Xuan, et al.
Published: (2026)
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models
by: Zhu, Yakun, et al.
Published: (2025)
by: Zhu, Yakun, et al.
Published: (2025)
COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models
by: Ren, Yuchen, et al.
Published: (2024)
by: Ren, Yuchen, et al.
Published: (2024)
MLB: A Scenario-Driven Benchmark for Evaluating Large Language Models in Clinical Applications
by: He, Qing, et al.
Published: (2026)
by: He, Qing, et al.
Published: (2026)
Benchmarking Reasoning Robustness in Large Language Models
by: Yu, Tong, et al.
Published: (2025)
by: Yu, Tong, et al.
Published: (2025)
Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks
by: Hong, Jindong, et al.
Published: (2025)
by: Hong, Jindong, et al.
Published: (2025)
T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step
by: Chen, Zehui, et al.
Published: (2023)
by: Chen, Zehui, et al.
Published: (2023)
ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning
by: Tang, Yuqi, et al.
Published: (2025)
by: Tang, Yuqi, et al.
Published: (2025)
Large Language Models for Multi-Robot Systems: A Survey
by: Li, Peihan, et al.
Published: (2025)
by: Li, Peihan, et al.
Published: (2025)
Systematic Outliers in Large Language Models
by: An, Yongqi, et al.
Published: (2025)
by: An, Yongqi, et al.
Published: (2025)
LLM-Flock: Decentralized Multi-Robot Flocking via Large Language Models and Influence-Based Consensus
by: Li, Peihan, et al.
Published: (2025)
by: Li, Peihan, et al.
Published: (2025)
Self-Pluralising Culture Alignment for Large Language Models
by: Xu, Shaoyang, et al.
Published: (2024)
by: Xu, Shaoyang, et al.
Published: (2024)
Similar Items
-
CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation
by: Yu, Guangya, et al.
Published: (2025) -
KG-o1: Enhancing Multi-hop Question Answering in Large Language Models via Knowledge Graph Integration
by: Wang, Nan, et al.
Published: (2025) -
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs
by: Fan, Yongqi, et al.
Published: (2025) -
MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens
by: Fan, Yongqi, et al.
Published: (2024) -
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models
by: Li, Xiaomin, et al.
Published: (2025)