Saved in:
Bibliographic Details
Main Authors: Zhang, Ivy, Rothenhäusler, Dominik
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.06570
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Researchers often face choices between multiple data sources that differ in quality, cost, and representativeness. Which sources will most improve predictive performance? We study this data prioritization problem under a random distribution shift model, where candidate sources arise from random perturbations to a target population. We propose the Data Usefulness Coefficient (DUC), which predicts the reduction in prediction error from adding a dataset to training, using only covariate summary statistics and no outcome data. We prove that under random shifts, covariate differences between sources are informative about outcome prediction quality. Through theory and experiments on synthetic and real data, we demonstrate that DUC-based selection outperforms alternative strategies, allowing more efficient resource allocation across heterogeneous data sources. The method provides interpretable rankings between candidate datasets and works for any data modality, including ordinal, categorical, and continuous data.