Ken Chen, Hamim Zafar, Yong Wang, Luay Nakhleh and Nicholas Navin
The University of Texas MD Anderson Cancer Center, USA
Posters & Accepted Abstracts: J Biom Biostat
Sequence data produced by recently developed single-cell sequencing (SCS) technologies have the power to resolve cancer genome at a single-cell level and can characterize the genomic alterations that might differ from one cell to another. Unresolved issues in cancer research pertaining to the admixture signal reflected by bulk-tissue sequencing data produced by the next-generation sequencing (NGS) methods can be potentially addressed through the proper analysis of single-cell sequencing data. The inherent errors associated with SCS data such as amplification errors and non-uniform coverage make the bioinformatics analysis of such data challenging. The existing SNV calling methods developed for NGS data tend to produce large number of false-positive calls when applied on SCS data. Here, we present Monovar, a novel statistical method for discovering and genotyping SNVs from SCS data. Monovar accounts for the various native errors present in the SCS data to distinguish between true variants and sequencing artifacts. The proposed multi-sample SNV calling method leverages data from multiple single cells to combat against non-uniform coverage distribution across single cells. The underlying probabilistic model for SNV calling accounts for allelic drop-out, deamination and other amplification errors associated with the SCS data. A candidate site is called as a SNV based on the posterior probability of the site being a SNV, calculated using Bayes� rule along with population genetic prior. The genotyping method also leverages data from other single cells to quantify the posterior probability of genotype calculated via a dynamic programming algorithm. We validated the sensitivity and specificity of Monovar using data from normal female fibroblast skin cells, for which, it achieved better performance compared to GATK and samtools, two state-of-the-art SNV calling methods for NGS data. We also applied Monovar on SCS data from triplenegative breast cancer cells. For two different cell lines of this dataset, Monovar dramatically outperformed GATK in terms of precision without affecting the detection efficiency. Finally, Monovar performed better than GATK on three public datasets as well proving it to be a versatile method applicable to SCS data generated using different technologies. A manuscript describing Monovar is currently in press at Nature Methods.
Email: kchen3@mdanderson.org
Journal of Biometrics & Biostatistics received 3496 citations as per Google Scholar report