Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sun, Yu, Chen, Keyu, Wang, Shujie, Li, Peiji, Guo, Qipeng, Yan, Hang, Qiu, Xipeng, Huang, Xuanjing, Lin, Dahua
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2401.14869
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911995105640448
author	Sun, Yu Chen, Keyu Wang, Shujie Li, Peiji Guo, Qipeng Yan, Hang Qiu, Xipeng Huang, Xuanjing Lin, Dahua
author_facet	Sun, Yu Chen, Keyu Wang, Shujie Li, Peiji Guo, Qipeng Yan, Hang Qiu, Xipeng Huang, Xuanjing Lin, Dahua
contents	Large language models (LLMs) garner significant attention for their unprecedented performance, leading to an increasing number of researches evaluating LLMs. However, these evaluation benchmarks are limited to assessing the instruction-following capabilities, overlooking the fundamental abilities that emerge during the pre-training stage. Previous subjective evaluation methods mainly reply on scoring by API models. However, in the absence of references, large models have shown limited ability to discern subtle differences. To bridge the gap, we propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. The tasks in F-Eval include multi-choice objective tasks, open-ended objective tasks, reference-based subjective tasks and reference-free subjective tasks. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models. We conduct evaluations on 13 advanced LLMs. Results show that our evaluation methods show higher correlation coefficients and larger distinction than other evaluators. Additionally, we discuss the influence of different model sizes, dimensions, and normalization methods. We anticipate that F-Eval will facilitate the study of LLMs' fundamental abilities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_14869
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods Sun, Yu Chen, Keyu Wang, Shujie Li, Peiji Guo, Qipeng Yan, Hang Qiu, Xipeng Huang, Xuanjing Lin, Dahua Computation and Language Large language models (LLMs) garner significant attention for their unprecedented performance, leading to an increasing number of researches evaluating LLMs. However, these evaluation benchmarks are limited to assessing the instruction-following capabilities, overlooking the fundamental abilities that emerge during the pre-training stage. Previous subjective evaluation methods mainly reply on scoring by API models. However, in the absence of references, large models have shown limited ability to discern subtle differences. To bridge the gap, we propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. The tasks in F-Eval include multi-choice objective tasks, open-ended objective tasks, reference-based subjective tasks and reference-free subjective tasks. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models. We conduct evaluations on 13 advanced LLMs. Results show that our evaluation methods show higher correlation coefficients and larger distinction than other evaluators. Additionally, we discuss the influence of different model sizes, dimensions, and normalization methods. We anticipate that F-Eval will facilitate the study of LLMs' fundamental abilities.
title	F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods
topic	Computation and Language
url	https://arxiv.org/abs/2401.14869

Similar Items