Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Miller, Evan
Format:	Preprint
Published:	2024
Subjects:	Applications Computation and Language
Online Access:	https://arxiv.org/abs/2411.00640
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913569748025344
author	Miller, Evan
author_facet	Miller, Evan
contents	Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_00640
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations Miller, Evan Applications Computation and Language Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.
title	Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
topic	Applications Computation and Language
url	https://arxiv.org/abs/2411.00640

Similar Items