Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Xiang Lisa, Kaiyom, Farzaan, Liu, Evan Zheran, Mai, Yifan, Liang, Percy, Hashimoto, Tatsunori
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2407.08351
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909515886100480
author	Li, Xiang Lisa Kaiyom, Farzaan Liu, Evan Zheran Mai, Yifan Liang, Percy Hashimoto, Tatsunori
author_facet	Li, Xiang Lisa Kaiyom, Farzaan Liu, Evan Zheran Mai, Yifan Liang, Percy Hashimoto, Tatsunori
contents	We present AutoBencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing language models. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with two settings with different optimization objectives: (i) for capability evaluation, we declare the goal of finding a salient, difficult dataset that induces novel performance patterns; (ii) for safety evaluation, we declare the goal of finding a dataset of unsafe prompts that existing LMs fail to decline. To tackle this optimization problem, we use a language model to iteratively propose and refine dataset descriptions, which are then used to generate topic-specific questions and answers. These descriptions are optimized to improve the declared desiderata. We use AutoBencher (powered by GPT-4) to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks. On the novelty ends, AutoBencher also helps identify specific gaps not captured by existing benchmarks: e.g., Gemini-Pro has knowledge gaps on Permian Extinction and Fordism while GPT-4o fails to decline harmful requests about cryptocurrency scams.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_08351
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	AutoBencher: Towards Declarative Benchmark Construction Li, Xiang Lisa Kaiyom, Farzaan Liu, Evan Zheran Mai, Yifan Liang, Percy Hashimoto, Tatsunori Computation and Language Machine Learning We present AutoBencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing language models. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with two settings with different optimization objectives: (i) for capability evaluation, we declare the goal of finding a salient, difficult dataset that induces novel performance patterns; (ii) for safety evaluation, we declare the goal of finding a dataset of unsafe prompts that existing LMs fail to decline. To tackle this optimization problem, we use a language model to iteratively propose and refine dataset descriptions, which are then used to generate topic-specific questions and answers. These descriptions are optimized to improve the declared desiderata. We use AutoBencher (powered by GPT-4) to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks. On the novelty ends, AutoBencher also helps identify specific gaps not captured by existing benchmarks: e.g., Gemini-Pro has knowledge gaps on Permian Extinction and Fordism while GPT-4o fails to decline harmful requests about cryptocurrency scams.
title	AutoBencher: Towards Declarative Benchmark Construction
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2407.08351

Similar Items