Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Jin, Li, Qingquan, Du, Wenlong
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2407.07531
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913424775053312
author	Liu, Jin Li, Qingquan Du, Wenlong
author_facet	Liu, Jin Li, Qingquan Du, Wenlong
contents	In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_07531
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models Liu, Jin Li, Qingquan Du, Wenlong Computation and Language In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.
title	Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models
topic	Computation and Language
url	https://arxiv.org/abs/2407.07531

Similar Items