Shunpu Zhang
University of Central Florida, USA
Posters & Accepted Abstracts: J Biom Biostat
Clustering is a common technique used by molecular biologists to group homologous sequences and identify co-expressed genes. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of a cluster. We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an effective way using 3D visualization to examine clusters. The above methods were applied for the clade assignment of influenza viral hemagglutinin (HA) sequences. For the high pathogenic avian influenza (HPAI) A (H5N1) HA sequences, nine clusters were obtained using the model-based method, which agrees with previous findings; the certainties for sequences assigned to a cluster were all 1.0, the certainties for clusters were also very high (0.92-1.0), with an overall clustering certainty of 0.95. For influenza A (H7) HA sequences, 10 HA clusters were assigned and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99; the certainties for clusters, however, varied from 0.40 to 0.98. We suspect such certainty variation is attributed to the dissimilar homogeneity of sequence data within cluster. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is more robust for the estimation of clustering certainty.
Email: shunpu.zhang@ucf.edu
Journal of Biometrics & Biostatistics received 3496 citations as per Google Scholar report