Saved in:
Bibliographic Details
Main Authors: Yang, Junlin, Zhang, Dylan, Song, Xiangchen, Dai, Qirun, Liu, Xiao, Chen, Yuen, Vashishtha, Aniket, Shi, Jing, Tan, Chenhao, Peng, Hao
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.26029
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913169435262976
author Yang, Junlin
Zhang, Dylan
Song, Xiangchen
Dai, Qirun
Liu, Xiao
Chen, Yuen
Vashishtha, Aniket
Shi, Jing
Tan, Chenhao
Peng, Hao
author_facet Yang, Junlin
Zhang, Dylan
Song, Xiangchen
Dai, Qirun
Liu, Xiao
Chen, Yuen
Vashishtha, Aniket
Shi, Jing
Tan, Chenhao
Peng, Hao
contents We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.
format Preprint
id arxiv_https___arxiv_org_abs_2605_26029
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
Yang, Junlin
Zhang, Dylan
Song, Xiangchen
Dai, Qirun
Liu, Xiao
Chen, Yuen
Vashishtha, Aniket
Shi, Jing
Tan, Chenhao
Peng, Hao
Artificial Intelligence
Computation and Language
We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.
title CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
topic Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2605.26029