Saved in:
| Main Authors: | Yu, Yongan, Hu, Qingchen, Du, Xianda, Wang, Jiayin, Mo, Fengran, Sieber, Renee |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.20249 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives
by: Yu, Yongan, et al.
Published: (2025)
by: Yu, Yongan, et al.
Published: (2025)
MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues
by: Liu, Zheyuan, et al.
Published: (2026)
by: Liu, Zheyuan, et al.
Published: (2026)
A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models
by: Wang, Jiayin, et al.
Published: (2024)
by: Wang, Jiayin, et al.
Published: (2024)
Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models
by: Zhang, Shuo, et al.
Published: (2025)
by: Zhang, Shuo, et al.
Published: (2025)
FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models
by: Li, Chuan, et al.
Published: (2025)
by: Li, Chuan, et al.
Published: (2025)
Multilingual Collaborative Defense for Large Language Models
by: Li, Hongliang, et al.
Published: (2025)
by: Li, Hongliang, et al.
Published: (2025)
THiNK: Can Large Language Models Think-aloud?
by: Yu, Yongan, et al.
Published: (2025)
by: Yu, Yongan, et al.
Published: (2025)
Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models
by: Tian, Yuxing, et al.
Published: (2026)
by: Tian, Yuxing, et al.
Published: (2026)
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models
by: Liu, Ziqiang, et al.
Published: (2024)
by: Liu, Ziqiang, et al.
Published: (2024)
SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding
by: Zhang, Yazhou, et al.
Published: (2024)
by: Zhang, Yazhou, et al.
Published: (2024)
TurkBench: A Benchmark for Evaluating Turkish Large Language Models
by: Toraman, Çağrı, et al.
Published: (2026)
by: Toraman, Çağrı, et al.
Published: (2026)
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models
by: Du, Mengfei, et al.
Published: (2024)
by: Du, Mengfei, et al.
Published: (2024)
SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth
by: Xing, Wenpeng, et al.
Published: (2025)
by: Xing, Wenpeng, et al.
Published: (2025)
UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
by: Zheng, Yu, et al.
Published: (2025)
by: Zheng, Yu, et al.
Published: (2025)
QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models
by: Wu, Yao, et al.
Published: (2026)
by: Wu, Yao, et al.
Published: (2026)
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models
by: Hu, He, et al.
Published: (2025)
by: Hu, He, et al.
Published: (2025)
OR-Bench: An Over-Refusal Benchmark for Large Language Models
by: Cui, Justin, et al.
Published: (2024)
by: Cui, Justin, et al.
Published: (2024)
OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology
by: Zhou, Chengfeng, et al.
Published: (2025)
by: Zhou, Chengfeng, et al.
Published: (2025)
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression
by: Kundu, Souvik, et al.
Published: (2025)
by: Kundu, Souvik, et al.
Published: (2025)
AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models
by: Lee, Jaeho, et al.
Published: (2025)
by: Lee, Jaeho, et al.
Published: (2025)
InFoBench: Evaluating Instruction Following Ability in Large Language Models
by: Qin, Yiwei, et al.
Published: (2024)
by: Qin, Yiwei, et al.
Published: (2024)
DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models
by: Chung, Tsz Ting, et al.
Published: (2025)
by: Chung, Tsz Ting, et al.
Published: (2025)
PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology
by: Zhao, Yimin, et al.
Published: (2026)
by: Zhao, Yimin, et al.
Published: (2026)
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models
by: LI, Yizhi, et al.
Published: (2024)
by: LI, Yizhi, et al.
Published: (2024)
TaskBench: Benchmarking Large Language Models for Task Automation
by: Shen, Yongliang, et al.
Published: (2023)
by: Shen, Yongliang, et al.
Published: (2023)
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
by: Liu, Mianxin, et al.
Published: (2024)
by: Liu, Mianxin, et al.
Published: (2024)
Can Large Language Models Understand Preferences in Personalized Recommendation?
by: Tan, Zhaoxuan, et al.
Published: (2025)
by: Tan, Zhaoxuan, et al.
Published: (2025)
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
by: Li, Lijun, et al.
Published: (2024)
by: Li, Lijun, et al.
Published: (2024)
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models
by: Tu, Shangqing, et al.
Published: (2023)
by: Tu, Shangqing, et al.
Published: (2023)
StakeBench: Evaluating Language Understanding Grounded in Market Commitment
by: Pei, Yunhua, et al.
Published: (2026)
by: Pei, Yunhua, et al.
Published: (2026)
CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks
by: Feng, Jie, et al.
Published: (2024)
by: Feng, Jie, et al.
Published: (2024)
SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models
by: Lai, Peichao, et al.
Published: (2025)
by: Lai, Peichao, et al.
Published: (2025)
RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
by: Yang, Shuo, et al.
Published: (2025)
by: Yang, Shuo, et al.
Published: (2025)
TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health
by: Xiong, Zixin, et al.
Published: (2026)
by: Xiong, Zixin, et al.
Published: (2026)
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset
by: Zhu, Jie, et al.
Published: (2024)
by: Zhu, Jie, et al.
Published: (2024)
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
by: Xu, Xin, et al.
Published: (2025)
by: Xu, Xin, et al.
Published: (2025)
DarkBench: Benchmarking Dark Patterns in Large Language Models
by: Kran, Esben, et al.
Published: (2025)
by: Kran, Esben, et al.
Published: (2025)
AlignBench: Benchmarking Chinese Alignment of Large Language Models
by: Liu, Xiao, et al.
Published: (2023)
by: Liu, Xiao, et al.
Published: (2023)
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
by: Adelani, David Ifeoluwa, et al.
Published: (2024)
by: Adelani, David Ifeoluwa, et al.
Published: (2024)
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
by: Bai, Ge, et al.
Published: (2024)
by: Bai, Ge, et al.
Published: (2024)
Similar Items
-
WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives
by: Yu, Yongan, et al.
Published: (2025) -
MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues
by: Liu, Zheyuan, et al.
Published: (2026) -
A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models
by: Wang, Jiayin, et al.
Published: (2024) -
Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models
by: Zhang, Shuo, et al.
Published: (2025) -
FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models
by: Li, Chuan, et al.
Published: (2025)