Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.26029 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913169435262976 |
|---|---|
| author | Yang, Junlin Zhang, Dylan Song, Xiangchen Dai, Qirun Liu, Xiao Chen, Yuen Vashishtha, Aniket Shi, Jing Tan, Chenhao Peng, Hao |
| author_facet | Yang, Junlin Zhang, Dylan Song, Xiangchen Dai, Qirun Liu, Xiao Chen, Yuen Vashishtha, Aniket Shi, Jing Tan, Chenhao Peng, Hao |
| contents | We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge.
Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2605_26029 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists Yang, Junlin Zhang, Dylan Song, Xiangchen Dai, Qirun Liu, Xiao Chen, Yuen Vashishtha, Aniket Shi, Jing Tan, Chenhao Peng, Hao Artificial Intelligence Computation and Language We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners. |
| title | CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists |
| topic | Artificial Intelligence Computation and Language |
| url | https://arxiv.org/abs/2605.26029 |