Saved in:
Bibliographic Details
Main Authors: Liu, Jin, Li, Qingquan, Du, Wenlong
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.07531
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913424775053312
author Liu, Jin
Li, Qingquan
Du, Wenlong
author_facet Liu, Jin
Li, Qingquan
Du, Wenlong
contents In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.
format Preprint
id arxiv_https___arxiv_org_abs_2407_07531
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models
Liu, Jin
Li, Qingquan
Du, Wenlong
Computation and Language
In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.
title Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models
topic Computation and Language
url https://arxiv.org/abs/2407.07531