Biology Sequence Clustering Via Phylogenetic Trees

Authors

  • ANSHUMAN MISHRA Author

Keywords:

Biology, Sequence, Clustering, Phylogenetic Trees

Abstract

For many bioinformatics tasks, it is necessary to group comparable sequences together. Sequences tend to group together because of the evolutionary links among them. And yet, despite this evidence and the obvious ways in which a
Despite the fact that a phylogenetic tree may be used to create groups, most sequence clustering tools instead employ pairwise sequence distances to do their analyses. We contend that tree-based clustering is not being fully used because of the development of large-scale phylogenetic inference. For each given tree, we describe a class of optimization problems that, when solved, provide the fewest possible clusters while satisfying specified heterogeneity requirements. We focus on three distinct restrictions, which limit either (1) the size of each cluster, (2) the total length of its branches, or (3) the length of chains of pairwise distances. For two of the three requirements, the methods have been known for some time in the theoretical computer science literature. The time required to solve these issues grows linearly with the size of the tree. Using these techniques, we develop a program called TreeCluster and evaluate it on three different uses: clustering of OTUs in microbiome data, clustering of HIV transmission, and divide-and-conquer multiple sequence alignment. We demonstrate how TreeCluster's use of tree-based distances produces more internally consistent clusters than competing methods and boosts the efficiency of subsequent applications. Check out https://github.com/niemasd/TreeCluster to download TreeCluster.

Downloads

Download data is not yet available.

Downloads

Published

31-07-2021

How to Cite

Biology Sequence Clustering Via Phylogenetic Trees. (2021). International Journal of Information Technology and Computer Engineering, 9(3), 108-124. https://ijitce.org/index.php/ijitce/article/view/247