Saved in:
| Main Authors: | Liu, Xinyu, Jin, Ke |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2408.10921 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition
by: Osman, Mohamed, et al.
Published: (2024)
by: Osman, Mohamed, et al.
Published: (2024)
FinReflectKG -- EvalBench: Benchmarking Financial KG with Multi-Dimensional Evaluation
by: Dimino, Fabrizio, et al.
Published: (2025)
by: Dimino, Fabrizio, et al.
Published: (2025)
CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
by: Yu, Linhao, et al.
Published: (2024)
by: Yu, Linhao, et al.
Published: (2024)
KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding
by: Hwang, Bokwang, et al.
Published: (2025)
by: Hwang, Bokwang, et al.
Published: (2025)
EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education
by: Ma, Guoqing, et al.
Published: (2025)
by: Ma, Guoqing, et al.
Published: (2025)
PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism AI Psychological Counselor
by: Pan, Qianjun, et al.
Published: (2026)
by: Pan, Qianjun, et al.
Published: (2026)
Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study
by: Yao, Xuan, et al.
Published: (2025)
by: Yao, Xuan, et al.
Published: (2025)
MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection
by: Li, Yupeng, et al.
Published: (2024)
by: Li, Yupeng, et al.
Published: (2024)
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
by: Ye, Fangda, et al.
Published: (2026)
by: Ye, Fangda, et al.
Published: (2026)
GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving
by: Zhang, Jiaxin, et al.
Published: (2024)
by: Zhang, Jiaxin, et al.
Published: (2024)
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset
by: Zhu, Jie, et al.
Published: (2024)
by: Zhu, Jie, et al.
Published: (2024)
LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding
by: Jubair, Sheikh, et al.
Published: (2025)
by: Jubair, Sheikh, et al.
Published: (2025)
Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset
by: Chen, Qian, et al.
Published: (2026)
by: Chen, Qian, et al.
Published: (2026)
MindEval: Benchmarking Language Models on Multi-turn Mental Health Support
by: Pombal, José, et al.
Published: (2025)
by: Pombal, José, et al.
Published: (2025)
SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials
by: Li, Zhaohui, et al.
Published: (2026)
by: Li, Zhaohui, et al.
Published: (2026)
AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation
by: Zhang, Tanghaoran, et al.
Published: (2026)
by: Zhang, Tanghaoran, et al.
Published: (2026)
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models
by: Li, Zhong-Zhi, et al.
Published: (2024)
by: Li, Zhong-Zhi, et al.
Published: (2024)
ContractEval: Benchmarking LLMs for Clause-Level Legal Risk Identification in Commercial Contracts
by: Liu, Shuang, et al.
Published: (2025)
by: Liu, Shuang, et al.
Published: (2025)
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges
by: Wang, Clinton J., et al.
Published: (2025)
by: Wang, Clinton J., et al.
Published: (2025)
WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation
by: Qian, Yaoyao, et al.
Published: (2025)
by: Qian, Yaoyao, et al.
Published: (2025)
MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models
by: Einarsson, Hafsteinn
Published: (2025)
by: Einarsson, Hafsteinn
Published: (2025)
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
by: He, Zheqi, et al.
Published: (2024)
by: He, Zheqi, et al.
Published: (2024)
EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing
by: Wang, Shengbo, et al.
Published: (2025)
by: Wang, Shengbo, et al.
Published: (2025)
MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs
by: Zhao, Chenchen, et al.
Published: (2025)
by: Zhao, Chenchen, et al.
Published: (2025)
Are Large Language Models Good In-context Learners for Financial Sentiment Analysis?
by: Wei, Xinyu, et al.
Published: (2025)
by: Wei, Xinyu, et al.
Published: (2025)
Understanding Financial Reasoning in AI: A Multimodal Benchmark and Error Learning Approach
by: Deng, Shuangyan, et al.
Published: (2025)
by: Deng, Shuangyan, et al.
Published: (2025)
FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models
by: Liu, Yan, et al.
Published: (2024)
by: Liu, Yan, et al.
Published: (2024)
SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment
by: Jiang, Sihang, et al.
Published: (2026)
by: Jiang, Sihang, et al.
Published: (2026)
LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
by: Gao, Hengjian, et al.
Published: (2026)
by: Gao, Hengjian, et al.
Published: (2026)
MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs
by: Zhang, Mengyuan, et al.
Published: (2024)
by: Zhang, Mengyuan, et al.
Published: (2024)
ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation
by: Paul, Debalina Ghosh, et al.
Published: (2024)
by: Paul, Debalina Ghosh, et al.
Published: (2024)
UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text, Images and Videos
by: Yang, Zhi, et al.
Published: (2026)
by: Yang, Zhi, et al.
Published: (2026)
NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark
by: Mikhailov, Vladislav, et al.
Published: (2025)
by: Mikhailov, Vladislav, et al.
Published: (2025)
CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding
by: Luo, Hanjun, et al.
Published: (2025)
by: Luo, Hanjun, et al.
Published: (2025)
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
by: Li, Chenxin, et al.
Published: (2026)
by: Li, Chenxin, et al.
Published: (2026)
InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents
by: Zhu, Zhenghao, et al.
Published: (2025)
by: Zhu, Zhenghao, et al.
Published: (2025)
Multi-IaC-Eval: Benchmarking Cloud Infrastructure as Code Across Multiple Formats
by: Davidson, Sam, et al.
Published: (2025)
by: Davidson, Sam, et al.
Published: (2025)
Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024
by: Chandra, Nuria Alina, et al.
Published: (2025)
by: Chandra, Nuria Alina, et al.
Published: (2025)
OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models
by: Liu, Yuhe, et al.
Published: (2023)
by: Liu, Yuhe, et al.
Published: (2023)
RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following
by: Pan, Tianjun, et al.
Published: (2026)
by: Pan, Tianjun, et al.
Published: (2026)
Similar Items
-
SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition
by: Osman, Mohamed, et al.
Published: (2024) -
FinReflectKG -- EvalBench: Benchmarking Financial KG with Multi-Dimensional Evaluation
by: Dimino, Fabrizio, et al.
Published: (2025) -
CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
by: Yu, Linhao, et al.
Published: (2024) -
KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding
by: Hwang, Bokwang, et al.
Published: (2025) -
EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education
by: Ma, Guoqing, et al.
Published: (2025)