Interactive scRNA-seq cell-type classification via clustifyr identifies widespread omission of cell-level annotations in public data repositories Sidhant Puntambekar1,2, Jay R. Hesselberth1,3, Kent A. Riemondy1*, Rui Fu1* 1 RNA Bioscience Initiative, University of Colorado School of Medicine, Aurora, CO, 80045, USA. 2 University of Colorado, Boulder, CO, 80309, USA.3 Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, 80045, USA. * Co-corresponding authors. Contact: firstname.lastname@example.org Thousands of scRNA-seq datasets have been generated, providing a wealth of biological data on the diversity of cell types across different organisms, developmental stages, and disease states. However, much of these data are challenging to reuse due to vague reporting guidelines for single cell data. In an effort to curate large scale atlases of cell types from public single cell dataset, we programmatically examined records in the Gene Expression Omnibus (GEO) and ArrayExpress, and estimate that only a minority (< 25%) of studies provide sufficient information to enable direct reuse of their data for further studies, without retracing and reproducing analyses from original publications. This problem is common across journals, data repositories, and publication dates. The lack of appropriate cell-level metadata not only hinders exploration and knowledge transfer of reported data, but also makes reproducing the original study prohibitively difficult and/or time-consuming. To facilitate exploration of public single cell datasets, we developed an interactive Shiny app version of clustifyr, a bioconductor package for single cell cluster identity classification. This clustifyr app can be accessed online at https://raysinensis.shinyapps.io/clustifyr-web-app/ and launched locally in R, for quick examination of processed data present in a GEO record and comparison of reference single cell datasets to provided query data. Our tool seeks to alleviate the challenges of selection, reanalysis, and reuse of published single cell datasets.