Saved in:
Bibliographic Details
Main Authors: Gu, Ken, Shang, Ruoxi, Jiang, Ruien, Kuang, Keying, Lin, Richard-John, Lyu, Donghe, Mao, Yue, Pan, Youran, Wu, Teng, Yu, Jiaqian, Zhang, Yikun, Zhang, Tianmai M., Zhu, Lanyi, Merrill, Mike A., Heer, Jeffrey, Althoff, Tim
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2408.09667
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908639298584576
author Gu, Ken
Shang, Ruoxi
Jiang, Ruien
Kuang, Keying
Lin, Richard-John
Lyu, Donghe
Mao, Yue
Pan, Youran
Wu, Teng
Yu, Jiaqian
Zhang, Yikun
Zhang, Tianmai M.
Zhu, Lanyi
Merrill, Mike A.
Heer, Jeffrey
Althoff, Tim
author_facet Gu, Ken
Shang, Ruoxi
Jiang, Ruien
Kuang, Keying
Lin, Richard-John
Lyu, Donghe
Mao, Yue
Pan, Youran
Wu, Teng
Yu, Jiaqian
Zhang, Yikun
Zhang, Tianmai M.
Zhu, Lanyi
Merrill, Mike A.
Heer, Jeffrey
Althoff, Tim
contents Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents' analysis approaches.
format Preprint
id arxiv_https___arxiv_org_abs_2408_09667
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle BLADE: Benchmarking Language Model Agents for Data-Driven Science
Gu, Ken
Shang, Ruoxi
Jiang, Ruien
Kuang, Keying
Lin, Richard-John
Lyu, Donghe
Mao, Yue
Pan, Youran
Wu, Teng
Yu, Jiaqian
Zhang, Yikun
Zhang, Tianmai M.
Zhu, Lanyi
Merrill, Mike A.
Heer, Jeffrey
Althoff, Tim
Computation and Language
Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents' analysis approaches.
title BLADE: Benchmarking Language Model Agents for Data-Driven Science
topic Computation and Language
url https://arxiv.org/abs/2408.09667