Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhan, Ling, Yu, Xiaoyao, Jia, Tao
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.24558
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917527946264576
author	Zhan, Ling Yu, Xiaoyao Jia, Tao
author_facet	Zhan, Ling Yu, Xiaoyao Jia, Tao
contents	AI for Science (AI4Science) workflows often treat the released dataset as a fixed interface to the underlying system. However, in domains relying on \emph{indirect observation}, the learner observes a derivative representation produced by multi-stage measurement, reconstruction, and preprocessing pipelines. \textbf{We argue that these measurement-to-dataset pipelines are inference components: treating their outputs as ``given data'' freezes an observation model and obscures uncertainty over feasible pipeline choices.} We identify three failure modes arising from this ``frozen lens'': \textbf{(C1) hidden hypothesis space}, where the released dataset does not specify the pipeline configuration or its validity conditions; \textbf{(C2) uncertified transportability}, where a pipeline may be documented but its regime of validity is untested, so failures under distribution shift cannot be adjudicated; \textbf{(C3) ungoverned multiplicity}, where many defensible pipelines exist and dispersion is real but not propagated into uncertainty-aware evidence. We stress-test these claims with a large-scale neuroscience empirical audit, finding a survival rate of $\approx 0.0004\%$ under a cross-dataset stability criterion. We call on the AI4Science community to make pipelines \emph{computable} inference objects via domain-specific Computable Observation Frameworks. This shift enables quantifying pipeline adequacy and stability, converting implicit implementation choices into auditable, reproducible, and cumulative scientific evidence.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_24558
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Position: AI for Science Should Treat Measurement-to-Dataset Pipelines as Inference Components Zhan, Ling Yu, Xiaoyao Jia, Tao Machine Learning AI for Science (AI4Science) workflows often treat the released dataset as a fixed interface to the underlying system. However, in domains relying on \emph{indirect observation}, the learner observes a derivative representation produced by multi-stage measurement, reconstruction, and preprocessing pipelines. \textbf{We argue that these measurement-to-dataset pipelines are inference components: treating their outputs as ``given data'' freezes an observation model and obscures uncertainty over feasible pipeline choices.} We identify three failure modes arising from this ``frozen lens'': \textbf{(C1) hidden hypothesis space}, where the released dataset does not specify the pipeline configuration or its validity conditions; \textbf{(C2) uncertified transportability}, where a pipeline may be documented but its regime of validity is untested, so failures under distribution shift cannot be adjudicated; \textbf{(C3) ungoverned multiplicity}, where many defensible pipelines exist and dispersion is real but not propagated into uncertainty-aware evidence. We stress-test these claims with a large-scale neuroscience empirical audit, finding a survival rate of $\approx 0.0004\%$ under a cross-dataset stability criterion. We call on the AI4Science community to make pipelines \emph{computable} inference objects via domain-specific Computable Observation Frameworks. This shift enables quantifying pipeline adequacy and stability, converting implicit implementation choices into auditable, reproducible, and cumulative scientific evidence.
title	Position: AI for Science Should Treat Measurement-to-Dataset Pipelines as Inference Components
topic	Machine Learning
url	https://arxiv.org/abs/2605.24558

Similar Items