Multicohort Analysis of Bronchial Epithelial Cell Gene Expression Classifies Asthma from Healthy
Authors: Ian Lee1,2,3, Ananthkrishnan Ganesan1,2, Purvesh Khatri1,2,*
1Institute for Immunity, Transplantation and Infection, School of Medicine, Stanford University, CA 94305
2 Center for Biomedical Informatics Research, Department of Medicine, School of Medicine, Stanford University, CA 94305
3 Stanford Pediatric Pulmonary Medicine, Stanford University, CA 94303
Asthma is a heterogeneous disease with variable clinical manifestations including wheezing, shortness of breath, cough, and airflow limitation varying over time. Previous transcriptome studies of airway epithelial cells in asthma compared to controls have identified hundreds of differentially expressed genes and characterized “T-helper cell 2 (Th2)-high” and “Th2-low” endotypes of asthma inflammation.
We hypothesized that integrating gene expression profiles of bronchial epithelial cells from patients with asthma across multiple studies would identify a robust gene signature that represents real-world biological and clinical heterogeneity in patients with asthma. We identified six data sets containing 486 whole transcriptome profiles of bronchial epithelial cells (BECs) from healthy controls (HCs) and patients with asthma of varying severity from at least four clinical centers in two countries. We arbitrarily chose four data sets comprised of 223 samples (HC=97, mild/moderate asthma=82, severe asthma=44) as the discovery data sets, and the remaining two consisting of 263 samples (HC=47, mild/moderate asthma=122, severe asthma=97) as the validation data sets. We calculated a Hedges’ g effect size (ES) for each gene. We applied leave-one-study-out analysis to avoid influence of a single data set, and used random effects inverse variance-based meta-analysis to integrate ES for each gene across the discovery data sets into a summary ES.
Using stringent selection criteria (FDR < 1%, absolute effect size > 0.6), we found 10 genes significantly differentially expressed between patients with asthma and healthy controls, including 5 over-expressed (POSTN, SERPINB2, TPRXL, CLCA1, CEACAM5) and 5 under-expressed (ACKR3, SCGB3A1, GMNN, CYP2A13, CNTD1) genes without between dataset heterogeneity.
We defined the asthma score of a sample as the difference between the geometric mean of over-expressed genes and that of the under-expressed genes. This simple classifier distinguished patients with asthma from healthy controls with an average area under the receiver operating characteristic (AUROC) curve of 0.87 (range: 0.79-0.91). Next, we validated this signature in the two independent data sets, where the asthma score distinguished patients with asthma from healthy controls with AUROC of 0.79 and 0.81. Across all discovery and validation cohorts, the asthma score increased with severity of asthma and was significantly positively correlated with it (Jonckheere-Terpstra trend test p-value < 0.05). We also observed a bimodal distribution of our asthma score in some data sets, which suggests the existence of endotypes of asthma that are not well classified by this gene signature and should be investigated further.
Our analysis identified a parsimonious gene set that distinguishes patients with asthma from a heterogeneous group of controls, including allergic rhinitis and former smokers, with high accuracy, and is positively correlated with severity of asthma. This could suggest a shared pathway despite the heterogeneity observed in patients with asthma.