Mutual Nearest Neighbours Merging of scRNA-Seq Data Sets
Laleh Haghverdi1, Michael Morgan2, Aaron Lun3, John Marioni1,3*
1EMBL-European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge CB10 1SD, UK; 2Wellcome Trust Sanger Institute, Wellcome Genome Campus, Cambridge CB10 1SA, UK; 3Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge CB2 0RE, UK
The Human Cell Atlas, and other similar projects, are proposing to apply single-cell RNA-sequencing to study the transcriptomes of millions of cells. Inevitably, such projects will generate data in multiple laboratories, each of which might use a different experimental technology, as well as distinct cell dissociation and handling protocols. A key challenge therefore is to robustly combine these different datasets. The ability to merge measurements from different laboratories could avoid repeating experiments across multiple sites as well as enhancing the ability to compare and contrast a wide variety of cell types. However, the ability to do this is confounded by unwanted variation in the noisy high throughput single-cell measurements that can differ from day-to-day, laboratory-to-laboratory, individual-to-individual and platform-to-platform. The task of disentanglement between the signal of interest (when it is not a priori known) from the unwanted variation is generally difficult for such high-dimensional and noisy data, nonetheless it is resolvable for few specific scenarios. We study the scenario where each data set shares one or more subpopulations of cells with at least one other data set. We propose a new method for identification of the shared subpopulations based on mutual nearest neighbouring cells search. We subsequently use those shared subpopulations for learning and correcting the unwanted variation among different data sets.