Saved in:
| Main Authors: | Liu, Jin, Li, Qingquan, Du, Wenlong |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.07531 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
by: Guo, Qianhong, et al.
Published: (2025)
by: Guo, Qianhong, et al.
Published: (2025)
Ethical Considerations of Large Language Models in Game Playing
by: Zhang, Qingquan, et al.
Published: (2025)
by: Zhang, Qingquan, et al.
Published: (2025)
OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models
by: Wang, Shuai, et al.
Published: (2024)
by: Wang, Shuai, et al.
Published: (2024)
LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models
by: Liu, Chuang, et al.
Published: (2024)
by: Liu, Chuang, et al.
Published: (2024)
FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models
by: Liu, Shu, et al.
Published: (2024)
by: Liu, Shu, et al.
Published: (2024)
From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation
by: Sibaee, Serry, et al.
Published: (2025)
by: Sibaee, Serry, et al.
Published: (2025)
JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models
by: Liu, Junyu, et al.
Published: (2026)
by: Liu, Junyu, et al.
Published: (2026)
A Paradigm Shift: The Future of Machine Translation Lies with Large Language Models
by: Lyu, Chenyang, et al.
Published: (2023)
by: Lyu, Chenyang, et al.
Published: (2023)
Exploring Accuracy-Fairness Trade-off in Large Language Models
by: Zhang, Qingquan, et al.
Published: (2024)
by: Zhang, Qingquan, et al.
Published: (2024)
UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
by: Zheng, Yu, et al.
Published: (2025)
by: Zheng, Yu, et al.
Published: (2025)
A Comprehensive Evaluation of Quantization Strategies for Large Language Models
by: Jin, Renren, et al.
Published: (2024)
by: Jin, Renren, et al.
Published: (2024)
WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models
by: Zhao, Wenlong, et al.
Published: (2024)
by: Zhao, Wenlong, et al.
Published: (2024)
The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models
by: Liu, Yilun, et al.
Published: (2026)
by: Liu, Yilun, et al.
Published: (2026)
FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models
by: Liu, Yan, et al.
Published: (2024)
by: Liu, Yan, et al.
Published: (2024)
Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study
by: Xu, Liuchang, et al.
Published: (2024)
by: Xu, Liuchang, et al.
Published: (2024)
A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models
by: Xu, Haoran, et al.
Published: (2023)
by: Xu, Haoran, et al.
Published: (2023)
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
by: Cui, Tianyu, et al.
Published: (2024)
by: Cui, Tianyu, et al.
Published: (2024)
A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
by: Li, Zongxia, et al.
Published: (2025)
by: Li, Zongxia, et al.
Published: (2025)
LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models
by: Li, Haitao, et al.
Published: (2024)
by: Li, Haitao, et al.
Published: (2024)
EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models
by: Chen, Yuyan, et al.
Published: (2024)
by: Chen, Yuyan, et al.
Published: (2024)
Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models
by: Wu, Haotian, et al.
Published: (2025)
by: Wu, Haotian, et al.
Published: (2025)
CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation
by: Yu, Guangya, et al.
Published: (2025)
by: Yu, Guangya, et al.
Published: (2025)
McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models
by: Lan, Tian, et al.
Published: (2025)
by: Lan, Tian, et al.
Published: (2025)
OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models
by: Liu, Yang, et al.
Published: (2024)
by: Liu, Yang, et al.
Published: (2024)
A Novel Paradigm Boosting Translation Capabilities of Large Language Models
by: Guo, Jiaxin, et al.
Published: (2024)
by: Guo, Jiaxin, et al.
Published: (2024)
Beyond Instrumental and Substitutive Paradigms: Introducing Machine Culture as an Emergent Phenomenon in Large Language Models
by: Hu, Yueqing, et al.
Published: (2026)
by: Hu, Yueqing, et al.
Published: (2026)
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models
by: Fu, Lingyue, et al.
Published: (2023)
by: Fu, Lingyue, et al.
Published: (2023)
Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework
by: Jin, Jiandong, et al.
Published: (2024)
by: Jin, Jiandong, et al.
Published: (2024)
Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models
by: Au, Steven, et al.
Published: (2026)
by: Au, Steven, et al.
Published: (2026)
RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models
by: Shen, Tianhao, et al.
Published: (2023)
by: Shen, Tianhao, et al.
Published: (2023)
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation
by: Zhu, Qin, et al.
Published: (2024)
by: Zhu, Qin, et al.
Published: (2024)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models
by: Kwan, Wai-Chung, et al.
Published: (2024)
by: Kwan, Wai-Chung, et al.
Published: (2024)
New Evaluation Paradigm for Lexical Simplification
by: Qiang, Jipeng, et al.
Published: (2025)
by: Qiang, Jipeng, et al.
Published: (2025)
MLB: A Scenario-Driven Benchmark for Evaluating Large Language Models in Clinical Applications
by: He, Qing, et al.
Published: (2026)
by: He, Qing, et al.
Published: (2026)
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
by: Xuan, Weihao, et al.
Published: (2025)
by: Xuan, Weihao, et al.
Published: (2025)
Evaluating the Performance of Large Language Models on GAOKAO Benchmark
by: Zhang, Xiaotian, et al.
Published: (2023)
by: Zhang, Xiaotian, et al.
Published: (2023)
WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models
by: Yu, Yongan, et al.
Published: (2025)
by: Yu, Yongan, et al.
Published: (2025)
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models
by: Castillo-Bolado, David, et al.
Published: (2024)
by: Castillo-Bolado, David, et al.
Published: (2024)
JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models
by: Liu, Shuyi, et al.
Published: (2025)
by: Liu, Shuyi, et al.
Published: (2025)
NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates
by: Deng, Hexuan, et al.
Published: (2024)
by: Deng, Hexuan, et al.
Published: (2024)
Similar Items
-
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
by: Guo, Qianhong, et al.
Published: (2025) -
Ethical Considerations of Large Language Models in Game Playing
by: Zhang, Qingquan, et al.
Published: (2025) -
OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models
by: Wang, Shuai, et al.
Published: (2024) -
LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models
by: Liu, Chuang, et al.
Published: (2024) -
FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models
by: Liu, Shu, et al.
Published: (2024)