Saved in:
Bibliographic Details
Main Authors: Zhang, Shuangjie, Mallick, Bani K., Ni, Yang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.03178
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911654349897728
author Zhang, Shuangjie
Mallick, Bani K.
Ni, Yang
author_facet Zhang, Shuangjie
Mallick, Bani K.
Ni, Yang
contents Compositional data, which are vectors of proportions constrained to the probability simplex, arise frequently in modern scientific applications, including microbiome relative abundances across body sites and cell-type mixture weights derived from single-cell genomics. While regression methods for compositional data are well developed, no existing graphical model framework addresses the problem of learning conditional dependence structures among multiple compositional vectors. This paper introduces a novel framework for directed tree structure learning over compositional nodes. We employ the Kullback-Leibler divergence as the scoring function and model the conditional expectation of each child composition as a mixture of a baseline composition and a parent-driven component parameterized by a column-stochastic transition matrix. This formulation respects the simplex geometry, handles zero-inflated compositions gracefully, and, combined with a non-degeneracy condition on the transition matrix, ensures identifiability of edge directions from observational data. We prove consistency of structure recovery and derive finite-sample guarantees that characterize the required sample size in terms of the signal gap, node dimension, and penalty level. The efficacy of our approach is demonstrated through simulations and applications to multi-site microbiome data and single-cell data, yielding interpretable directed structures that align with known biological mechanisms.
format Preprint
id arxiv_https___arxiv_org_abs_2605_03178
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Structure Learning for Directed Trees with Zero-Inflated Compositional Nodes
Zhang, Shuangjie
Mallick, Bani K.
Ni, Yang
Methodology
Compositional data, which are vectors of proportions constrained to the probability simplex, arise frequently in modern scientific applications, including microbiome relative abundances across body sites and cell-type mixture weights derived from single-cell genomics. While regression methods for compositional data are well developed, no existing graphical model framework addresses the problem of learning conditional dependence structures among multiple compositional vectors. This paper introduces a novel framework for directed tree structure learning over compositional nodes. We employ the Kullback-Leibler divergence as the scoring function and model the conditional expectation of each child composition as a mixture of a baseline composition and a parent-driven component parameterized by a column-stochastic transition matrix. This formulation respects the simplex geometry, handles zero-inflated compositions gracefully, and, combined with a non-degeneracy condition on the transition matrix, ensures identifiability of edge directions from observational data. We prove consistency of structure recovery and derive finite-sample guarantees that characterize the required sample size in terms of the signal gap, node dimension, and penalty level. The efficacy of our approach is demonstrated through simulations and applications to multi-site microbiome data and single-cell data, yielding interpretable directed structures that align with known biological mechanisms.
title Structure Learning for Directed Trees with Zero-Inflated Compositional Nodes
topic Methodology
url https://arxiv.org/abs/2605.03178