Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	KaPatel, Samarth, Nikiforova, Sofia, Saggese, Giacinto Paolo, Smith, Paul
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.20333
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908849669144576
author	KaPatel, Samarth Nikiforova, Sofia Saggese, Giacinto Paolo Smith, Paul
author_facet	KaPatel, Samarth Nikiforova, Sofia Saggese, Giacinto Paolo Smith, Paul
contents	We present DMCD (DataMap Causal Discovery), a two-phase causal discovery framework that integrates LLM-based semantic drafting from variable metadata with statistical validation on observational data. In Phase I, a large language model proposes a sparse draft DAG, serving as a semantically informed prior over the space of possible causal structures. In Phase II, this draft is audited and refined via conditional independence testing, with detected discrepancies guiding targeted edge revisions. We evaluate our approach on three metadata-rich real-world benchmarks spanning industrial engineering, environmental monitoring, and IT systems analysis. Across these datasets, DMCD achieves competitive or leading performance against diverse causal discovery baselines, with particularly large gains in recall and F1 score. Probing and ablation experiments suggest that these improvements arise from semantic reasoning over metadata rather than memorization of benchmark graphs. Overall, our results demonstrate that combining semantic priors with principled statistical verification yields a high-performing and practically effective approach to causal structure learning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_20333
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	DMCD: Semantic-Statistical Framework for Causal Discovery KaPatel, Samarth Nikiforova, Sofia Saggese, Giacinto Paolo Smith, Paul Artificial Intelligence We present DMCD (DataMap Causal Discovery), a two-phase causal discovery framework that integrates LLM-based semantic drafting from variable metadata with statistical validation on observational data. In Phase I, a large language model proposes a sparse draft DAG, serving as a semantically informed prior over the space of possible causal structures. In Phase II, this draft is audited and refined via conditional independence testing, with detected discrepancies guiding targeted edge revisions. We evaluate our approach on three metadata-rich real-world benchmarks spanning industrial engineering, environmental monitoring, and IT systems analysis. Across these datasets, DMCD achieves competitive or leading performance against diverse causal discovery baselines, with particularly large gains in recall and F1 score. Probing and ablation experiments suggest that these improvements arise from semantic reasoning over metadata rather than memorization of benchmark graphs. Overall, our results demonstrate that combining semantic priors with principled statistical verification yields a high-performing and practically effective approach to causal structure learning.
title	DMCD: Semantic-Statistical Framework for Causal Discovery
topic	Artificial Intelligence
url	https://arxiv.org/abs/2602.20333

Similar Items