Saved in:
Bibliographic Details
Main Authors: Qu, Yuanke, Xu, Xiaoya, Zhang, Hengtao
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.05772
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915987308150784
author Qu, Yuanke
Xu, Xiaoya
Zhang, Hengtao
author_facet Qu, Yuanke
Xu, Xiaoya
Zhang, Hengtao
contents Double machine learning (DML) delivers valid inference on low-dimensional causal parameters while permitting flexible nuisance estimation, but its computational cost becomes prohibitive once cross-fitted learners must be trained on massive observational data. Applying DML to a uniformly drawn subsample alleviates this burden, yet such a reduction disregards the geometry of the covariate space and can exacerbate treated-control imbalance as well as overlap deficiency. We propose Uniform Design Double Machine Learning (UD-DML), a design-based subsampling strategy for average treatment effect (ATE) estimation. UD-DML first constructs a low-discrepancy skeleton in a PCA-rotated covariate space under the mixture-discrepancy criterion, and then assigns, to each skeleton point, the nearest treated and control units via KD-tree search. The resulting matched subsample is, by construction, both representative of the full covariate distribution and balanced across treatment arms; cross-fitted DML is subsequently applied to it. We establish discrepancy-based guarantees for representativeness and balance, and prove that the UD-DML estimator is $\sqrt{r}$-asymptotically normal under mild conditions, where the selected subsample size $r \ll n$. The dominant nuisance-fitting cost is thereby reduced from the $n$-scale to the $r$-scale. Monte Carlo experiments show that UD-DML attains lower RMSE, narrower confidence intervals and more reliable coverage than uniform subsampling, with the largest gains in low-overlap and misspecified regimes. An application to a large observational dataset further demonstrates its practical feasibility.
format Preprint
id arxiv_https___arxiv_org_abs_2605_05772
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle UD-DML: Uniform Design Subsampling for Double Machine Learning over Massive Data
Qu, Yuanke
Xu, Xiaoya
Zhang, Hengtao
Methodology
Double machine learning (DML) delivers valid inference on low-dimensional causal parameters while permitting flexible nuisance estimation, but its computational cost becomes prohibitive once cross-fitted learners must be trained on massive observational data. Applying DML to a uniformly drawn subsample alleviates this burden, yet such a reduction disregards the geometry of the covariate space and can exacerbate treated-control imbalance as well as overlap deficiency. We propose Uniform Design Double Machine Learning (UD-DML), a design-based subsampling strategy for average treatment effect (ATE) estimation. UD-DML first constructs a low-discrepancy skeleton in a PCA-rotated covariate space under the mixture-discrepancy criterion, and then assigns, to each skeleton point, the nearest treated and control units via KD-tree search. The resulting matched subsample is, by construction, both representative of the full covariate distribution and balanced across treatment arms; cross-fitted DML is subsequently applied to it. We establish discrepancy-based guarantees for representativeness and balance, and prove that the UD-DML estimator is $\sqrt{r}$-asymptotically normal under mild conditions, where the selected subsample size $r \ll n$. The dominant nuisance-fitting cost is thereby reduced from the $n$-scale to the $r$-scale. Monte Carlo experiments show that UD-DML attains lower RMSE, narrower confidence intervals and more reliable coverage than uniform subsampling, with the largest gains in low-overlap and misspecified regimes. An application to a large observational dataset further demonstrates its practical feasibility.
title UD-DML: Uniform Design Subsampling for Double Machine Learning over Massive Data
topic Methodology
url https://arxiv.org/abs/2605.05772