Saved in:
Bibliographic Details
Main Authors: Lei, Zhikai, Liang, Tianyi, Hu, Hanglei, Zhang, Jin, Zhou, Yunhua, Shao, Yunfan, Li, Linyang, Li, Chenchui, Wang, Changbo, Yan, Hang, Guo, Qipeng
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.10056
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916521721200640
author Lei, Zhikai
Liang, Tianyi
Hu, Hanglei
Zhang, Jin
Zhou, Yunhua
Shao, Yunfan
Li, Linyang
Li, Chenchui
Wang, Changbo
Yan, Hang
Guo, Qipeng
author_facet Lei, Zhikai
Liang, Tianyi
Hu, Hanglei
Zhang, Jin
Zhou, Yunhua
Shao, Yunfan
Li, Linyang
Li, Chenchui
Wang, Changbo
Yan, Hang
Guo, Qipeng
contents Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may ``game" these benchmarks due to data leakage, achieving high scores while struggling with tasks simple for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao), and conduct ``closed-book" evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even after addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistent performance across various question difficulties, and 2) high variance in performance on questions of similar difficulty. In addition, We identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenons are well-grounded in the motivations behind OpenAI o1, and o1's reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.
format Preprint
id arxiv_https___arxiv_org_abs_2412_10056
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?
Lei, Zhikai
Liang, Tianyi
Hu, Hanglei
Zhang, Jin
Zhou, Yunhua
Shao, Yunfan
Li, Linyang
Li, Chenchui
Wang, Changbo
Yan, Hang
Guo, Qipeng
Computation and Language
Artificial Intelligence
Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may ``game" these benchmarks due to data leakage, achieving high scores while struggling with tasks simple for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao), and conduct ``closed-book" evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even after addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistent performance across various question difficulties, and 2) high variance in performance on questions of similar difficulty. In addition, We identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenons are well-grounded in the motivations behind OpenAI o1, and o1's reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.
title GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2412.10056