Saved in:
Bibliographic Details
Main Authors: Li, Xiang Lisa, Kaiyom, Farzaan, Liu, Evan Zheran, Mai, Yifan, Liang, Percy, Hashimoto, Tatsunori
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.08351
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909515886100480
author Li, Xiang Lisa
Kaiyom, Farzaan
Liu, Evan Zheran
Mai, Yifan
Liang, Percy
Hashimoto, Tatsunori
author_facet Li, Xiang Lisa
Kaiyom, Farzaan
Liu, Evan Zheran
Mai, Yifan
Liang, Percy
Hashimoto, Tatsunori
contents We present AutoBencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing language models. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with two settings with different optimization objectives: (i) for capability evaluation, we declare the goal of finding a salient, difficult dataset that induces novel performance patterns; (ii) for safety evaluation, we declare the goal of finding a dataset of unsafe prompts that existing LMs fail to decline. To tackle this optimization problem, we use a language model to iteratively propose and refine dataset descriptions, which are then used to generate topic-specific questions and answers. These descriptions are optimized to improve the declared desiderata. We use AutoBencher (powered by GPT-4) to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks. On the novelty ends, AutoBencher also helps identify specific gaps not captured by existing benchmarks: e.g., Gemini-Pro has knowledge gaps on Permian Extinction and Fordism while GPT-4o fails to decline harmful requests about cryptocurrency scams.
format Preprint
id arxiv_https___arxiv_org_abs_2407_08351
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle AutoBencher: Towards Declarative Benchmark Construction
Li, Xiang Lisa
Kaiyom, Farzaan
Liu, Evan Zheran
Mai, Yifan
Liang, Percy
Hashimoto, Tatsunori
Computation and Language
Machine Learning
We present AutoBencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing language models. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with two settings with different optimization objectives: (i) for capability evaluation, we declare the goal of finding a salient, difficult dataset that induces novel performance patterns; (ii) for safety evaluation, we declare the goal of finding a dataset of unsafe prompts that existing LMs fail to decline. To tackle this optimization problem, we use a language model to iteratively propose and refine dataset descriptions, which are then used to generate topic-specific questions and answers. These descriptions are optimized to improve the declared desiderata. We use AutoBencher (powered by GPT-4) to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks. On the novelty ends, AutoBencher also helps identify specific gaps not captured by existing benchmarks: e.g., Gemini-Pro has knowledge gaps on Permian Extinction and Fordism while GPT-4o fails to decline harmful requests about cryptocurrency scams.
title AutoBencher: Towards Declarative Benchmark Construction
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2407.08351