Saved in:
Bibliographic Details
Main Authors: Li, Zhenghan, Wang, Tianying
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.15469
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913131835424768
author Li, Zhenghan
Wang, Tianying
author_facet Li, Zhenghan
Wang, Tianying
contents High-dimensional compositional covariates, often derived from count data, are subject to measurement error and are frequently analyzed after aggregation along a prespecified tree to improve interpretability in applications such as microbiome studies. Existing approaches typically handle either tree-guided compositional regression or errors-in-variables correction, but they do not account for the hierarchical contamination induced by their interaction. We show that tree aggregation turns leaf-level measurement error into level-dependent, correlated contamination across aggregated nodes, which inflates bias, weakens concentration rates for corrected estimating quantities, and leads to unstable variable selection for naive approaches. We propose Tree-Aggregated Regression with Correction for Observation Error (TARCO), which integrates bias-corrected estimating quantities with a tree-aware positive semidefinite stabilization and sparse regularization, with tuning selected by cross-validation based on the corrected objective. The resulting convex program can be solved with scalable algorithms. We establish finite-sample bounds for prediction and estimation errors and prove sign consistency under conditions that explicitly reflect tree heterogeneity. The guarantees persist when the measurement-error covariance is replaced by a consistent estimator. Simulations across multiple tree depths and a microbiome application demonstrate improved estimation accuracy, support recovery, and aggregation-level interpretability compared with methods that ignore the interaction between tree aggregation and measurement error.
format Preprint
id arxiv_https___arxiv_org_abs_2605_15469
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Tree-aggregated regression for compositional data with measurement errors
Li, Zhenghan
Wang, Tianying
Methodology
High-dimensional compositional covariates, often derived from count data, are subject to measurement error and are frequently analyzed after aggregation along a prespecified tree to improve interpretability in applications such as microbiome studies. Existing approaches typically handle either tree-guided compositional regression or errors-in-variables correction, but they do not account for the hierarchical contamination induced by their interaction. We show that tree aggregation turns leaf-level measurement error into level-dependent, correlated contamination across aggregated nodes, which inflates bias, weakens concentration rates for corrected estimating quantities, and leads to unstable variable selection for naive approaches. We propose Tree-Aggregated Regression with Correction for Observation Error (TARCO), which integrates bias-corrected estimating quantities with a tree-aware positive semidefinite stabilization and sparse regularization, with tuning selected by cross-validation based on the corrected objective. The resulting convex program can be solved with scalable algorithms. We establish finite-sample bounds for prediction and estimation errors and prove sign consistency under conditions that explicitly reflect tree heterogeneity. The guarantees persist when the measurement-error covariance is replaced by a consistent estimator. Simulations across multiple tree depths and a microbiome application demonstrate improved estimation accuracy, support recovery, and aggregation-level interpretability compared with methods that ignore the interaction between tree aggregation and measurement error.
title Tree-aggregated regression for compositional data with measurement errors
topic Methodology
url https://arxiv.org/abs/2605.15469