Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gu, Ken, Shang, Ruoxi, Jiang, Ruien, Kuang, Keying, Lin, Richard-John, Lyu, Donghe, Mao, Yue, Pan, Youran, Wu, Teng, Yu, Jiaqian, Zhang, Yikun, Zhang, Tianmai M., Zhu, Lanyi, Merrill, Mike A., Heer, Jeffrey, Althoff, Tim
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2408.09667
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908639298584576
author	Gu, Ken Shang, Ruoxi Jiang, Ruien Kuang, Keying Lin, Richard-John Lyu, Donghe Mao, Yue Pan, Youran Wu, Teng Yu, Jiaqian Zhang, Yikun Zhang, Tianmai M. Zhu, Lanyi Merrill, Mike A. Heer, Jeffrey Althoff, Tim
author_facet	Gu, Ken Shang, Ruoxi Jiang, Ruien Kuang, Keying Lin, Richard-John Lyu, Donghe Mao, Yue Pan, Youran Wu, Teng Yu, Jiaqian Zhang, Yikun Zhang, Tianmai M. Zhu, Lanyi Merrill, Mike A. Heer, Jeffrey Althoff, Tim
contents	Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents' analysis approaches.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_09667
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	BLADE: Benchmarking Language Model Agents for Data-Driven Science Gu, Ken Shang, Ruoxi Jiang, Ruien Kuang, Keying Lin, Richard-John Lyu, Donghe Mao, Yue Pan, Youran Wu, Teng Yu, Jiaqian Zhang, Yikun Zhang, Tianmai M. Zhu, Lanyi Merrill, Mike A. Heer, Jeffrey Althoff, Tim Computation and Language Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents' analysis approaches.
title	BLADE: Benchmarking Language Model Agents for Data-Driven Science
topic	Computation and Language
url	https://arxiv.org/abs/2408.09667

Similar Items