Don’t fear the zeros: Identifying transcriptional states and cellular populations in sparse single-cell RNA-seq data with Bayesian hierarchical modeling
Sean Corbett1, Zichun Liu1, Tianwen Huan1, Iris Yang1, Grant Duclos1, Jennifer Beane1, W. Evan Johnson1, Paola Sebastiani1, Masanao Yajima1, Joshua D. Campbell1,2
1Boston University, Boston, MA, USA; 2Broad Institute of MIT and Harvard, Cambridge, MA, USA
Background: Biology is divided into hierarchies: Complex tissues are composed of different cellular populations; each cell from each subpopulation contains a unique mixture of transcriptional states; and each transcriptional state is composed of groups of co-expressed genes. Single-cell RNA-seq can be used to explore these hierarchies by identifying all cellular populations within a sample and to determining the unique combination of transcriptional states that define each subpopulation. However, single cell RNA-seq data is noisy and contains many zeros due to the challenges inherent in amplifying small amounts of RNA.
Methods: With these goals and challenges in mind, we implemented Bayesian hierarchical models that reflect the hierarchies observed in biological systems. These models can be used to cluster co-expressed genes into transcriptional states, cells into subpopulations, and quantify the proportion of each cellular subpopulation within independent samples. Importantly, these models can handle sparse count-based data without additional normalization.
Results: We applied these models to single cell RNA-seq data generated from airway epithelial cells from smokers and non-smokers. Cell-type specific gene-expression alterations were induced in the airway epithelium of smokers. Additionally, a novel subpopulation was discovered in the airway of smokers that did not expression known cell-type markers and likely represents cells transitioning from a progenitor state to a secretory state.
Conclusions: Overall, these models represent novel approaches to characterizing cellular and transcriptional heterogeneity in biological samples using single-cell RNA-seq data.