scPipe: a pipeline for single cell RNA-seq data processing
Luyi Tian1, Shian Su1, Shalin Naik1, Matt Ritchie1*
1Molecular Medicine Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia
Single-cell RNA sequencing (scRNA-seq) is being increasingly used to profile the transcriptome of many single cells and provides profound biological resolution which cannot be achieved with regular bulk RNA-seq. Many protocols have recently been developed, which incorporates cellular barcode and molecular barcode in the polydT primer to increase the number assessable cells and to lower the cost. The cellular barcode can be designed or random and is used to labeling individual cells in the pooled library. The molecular barcode, which also called unique molecular identifier (UMI) is a random sequence that can be used to remove PCR duplicates. Apart from the barcoding, the mRNA spike-in control is frequently used for normalization and quality control, which also adds the complexity of the data.
As scRNA-seq data increases in its complexity, new computational challenges arise related to data processing, dealing with technical noise and methods of analysis. Here we present a computational workflow called scPipe. scPipe is written in R and C++ to be both user-friendly and quick. It aims to bridge the gap between the raw Fastq files obtained from the sequencer and the summarized matrix of counts that has undergone quality control. It includes five key aspects; i) data pre-processing, including de-multiplexing of molecular and cellular barcodes and error correction; ii) a novel mapping strategy specifically designed for 3'end protocols, leading to an informative gene count matrix for individual cells. During data preprocessing, scPipe will store detailed summary statistics that are useful for QC, including mapping rates, mRNA capture efficiency, gene counts etc.; iii) the scPipe R package provides a novel QC methods utilizing the spike-in information and UMI deduplication results. multiple normalization methods that covers most of the popular normalization methods designed for scRNA-seq data.; iv) downstream analysis for data exploration and sub-population identification using a novel approach for data filtering, and high dimensional analysis; v) a user-friendly app designed for better visualization and interpretation of the high dimensional data. We applied scPipe to our data and public scRNA-seq data generated by different protocols. We show that the scPipe can provide accurate results that comparable to the result from publications.
In summary, scPipe is an easy-to-use automated pipeline that aims to minimize the effort to generate the biological meaningful results from scRNA-seq barcoding protocols. The pipeline will deal with the complex barcode in scRNA-seq data, UMI count and correct possible sequencing errors. It contains a standard downstream workflow that generate the quality control statistics and remove low quality cells. As well as normalization and clustering using third party tools. The data object in scPipe is also easy to be converted to other formats such as the SCESet used by Scater and Scran.