Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Bartlett, Thomas E., Chandna, Swati, Roy, Sandipan
Format:	Preprint
Published:	2023
Subjects:	Methodology
Online Access:	https://arxiv.org/abs/2303.02498
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916434047664128
author	Bartlett, Thomas E. Chandna, Swati Roy, Sandipan
author_facet	Bartlett, Thomas E. Chandna, Swati Roy, Sandipan
contents	Important tasks in the study of genomic data include the identification of groups of similar cells (for example by clustering), and visualisation of data summaries (for example by dimensional reduction). In this paper, we develop a novel approach to these tasks in the context of single-cell genomic data. To do so, we propose to model the observed genomic data count matrix $\mathbf{X}\in\mathbb{Z}_{\geq0}^{p\times n}$, by representing these measurements as a bipartite network with multi-edges. Utilising this first-principles network model of the raw data, we cluster single cells in a suitably identified $d$-dimensional Laplacian Eigenspace (LE) via a Gaussian mixture model (GMM-LE), and employ UMAP to non-linearly project the LE to two dimensions for visualisation (UMAP-LE). This LE representation of the data-points estimates transformed latent positions (of genes and cells), under a latent position statistical model of nodes in a bipartite stochastic network. We demonstrate how transformations of these estimated latent positions can enable fine-grained clustering and visualisation of single-cell genomic data, by application to data from three recent genomics studies in different biological contexts. In each data application, clusters of cells independently learned by our proposed methodology are found to correspond to cells expressing specific marker genes that were independently defined by domain experts. In this validation setting, our proposed clustering methodology outperforms the industry-standard for these data. Furthermore, we validate components of the LE decomposition of the data by contrasting healthy cells from normal and at-risk groups in a machine-learning model, thereby identifying an LE cancer biomarker that significantly predicts long-term patient survival outcome in two independent validation cohorts with data from 1904 and 1091 individuals.
format	Preprint
id	arxiv_https___arxiv_org_abs_2303_02498
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	A stochastic network approach to clustering and visualising single-cell genomic count data Bartlett, Thomas E. Chandna, Swati Roy, Sandipan Methodology Important tasks in the study of genomic data include the identification of groups of similar cells (for example by clustering), and visualisation of data summaries (for example by dimensional reduction). In this paper, we develop a novel approach to these tasks in the context of single-cell genomic data. To do so, we propose to model the observed genomic data count matrix $\mathbf{X}\in\mathbb{Z}_{\geq0}^{p\times n}$, by representing these measurements as a bipartite network with multi-edges. Utilising this first-principles network model of the raw data, we cluster single cells in a suitably identified $d$-dimensional Laplacian Eigenspace (LE) via a Gaussian mixture model (GMM-LE), and employ UMAP to non-linearly project the LE to two dimensions for visualisation (UMAP-LE). This LE representation of the data-points estimates transformed latent positions (of genes and cells), under a latent position statistical model of nodes in a bipartite stochastic network. We demonstrate how transformations of these estimated latent positions can enable fine-grained clustering and visualisation of single-cell genomic data, by application to data from three recent genomics studies in different biological contexts. In each data application, clusters of cells independently learned by our proposed methodology are found to correspond to cells expressing specific marker genes that were independently defined by domain experts. In this validation setting, our proposed clustering methodology outperforms the industry-standard for these data. Furthermore, we validate components of the LE decomposition of the data by contrasting healthy cells from normal and at-risk groups in a machine-learning model, thereby identifying an LE cancer biomarker that significantly predicts long-term patient survival outcome in two independent validation cohorts with data from 1904 and 1091 individuals.
title	A stochastic network approach to clustering and visualising single-cell genomic count data
topic	Methodology
url	https://arxiv.org/abs/2303.02498

Similar Items