Saved in:
Bibliographic Details
Main Authors: Bartlett, Thomas E., Chandna, Swati, Roy, Sandipan
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2303.02498
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916434047664128
author Bartlett, Thomas E.
Chandna, Swati
Roy, Sandipan
author_facet Bartlett, Thomas E.
Chandna, Swati
Roy, Sandipan
contents Important tasks in the study of genomic data include the identification of groups of similar cells (for example by clustering), and visualisation of data summaries (for example by dimensional reduction). In this paper, we develop a novel approach to these tasks in the context of single-cell genomic data. To do so, we propose to model the observed genomic data count matrix $\mathbf{X}\in\mathbb{Z}_{\geq0}^{p\times n}$, by representing these measurements as a bipartite network with multi-edges. Utilising this first-principles network model of the raw data, we cluster single cells in a suitably identified $d$-dimensional Laplacian Eigenspace (LE) via a Gaussian mixture model (GMM-LE), and employ UMAP to non-linearly project the LE to two dimensions for visualisation (UMAP-LE). This LE representation of the data-points estimates transformed latent positions (of genes and cells), under a latent position statistical model of nodes in a bipartite stochastic network. We demonstrate how transformations of these estimated latent positions can enable fine-grained clustering and visualisation of single-cell genomic data, by application to data from three recent genomics studies in different biological contexts. In each data application, clusters of cells independently learned by our proposed methodology are found to correspond to cells expressing specific marker genes that were independently defined by domain experts. In this validation setting, our proposed clustering methodology outperforms the industry-standard for these data. Furthermore, we validate components of the LE decomposition of the data by contrasting healthy cells from normal and at-risk groups in a machine-learning model, thereby identifying an LE cancer biomarker that significantly predicts long-term patient survival outcome in two independent validation cohorts with data from 1904 and 1091 individuals.
format Preprint
id arxiv_https___arxiv_org_abs_2303_02498
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle A stochastic network approach to clustering and visualising single-cell genomic count data
Bartlett, Thomas E.
Chandna, Swati
Roy, Sandipan
Methodology
Important tasks in the study of genomic data include the identification of groups of similar cells (for example by clustering), and visualisation of data summaries (for example by dimensional reduction). In this paper, we develop a novel approach to these tasks in the context of single-cell genomic data. To do so, we propose to model the observed genomic data count matrix $\mathbf{X}\in\mathbb{Z}_{\geq0}^{p\times n}$, by representing these measurements as a bipartite network with multi-edges. Utilising this first-principles network model of the raw data, we cluster single cells in a suitably identified $d$-dimensional Laplacian Eigenspace (LE) via a Gaussian mixture model (GMM-LE), and employ UMAP to non-linearly project the LE to two dimensions for visualisation (UMAP-LE). This LE representation of the data-points estimates transformed latent positions (of genes and cells), under a latent position statistical model of nodes in a bipartite stochastic network. We demonstrate how transformations of these estimated latent positions can enable fine-grained clustering and visualisation of single-cell genomic data, by application to data from three recent genomics studies in different biological contexts. In each data application, clusters of cells independently learned by our proposed methodology are found to correspond to cells expressing specific marker genes that were independently defined by domain experts. In this validation setting, our proposed clustering methodology outperforms the industry-standard for these data. Furthermore, we validate components of the LE decomposition of the data by contrasting healthy cells from normal and at-risk groups in a machine-learning model, thereby identifying an LE cancer biomarker that significantly predicts long-term patient survival outcome in two independent validation cohorts with data from 1904 and 1091 individuals.
title A stochastic network approach to clustering and visualising single-cell genomic count data
topic Methodology
url https://arxiv.org/abs/2303.02498