skip to main content
10.1145/2649387.2649402acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

Fast dendrogram-based OTU clustering using sequence embedding

Published: 20 September 2014 Publication History

Abstract

Biodiversity assessment is an important step in a metagenomic processing pipeline. The biodiversity of a microbial metagenome is often estimated by grouping its 16S rRNA reads into operational taxonomic units or OTUs. These metagenomic datasets are typically large and hence require effective yet accurate computational methods for processing.
In this paper, we introduce a new hierarchical clustering method called CRiSPy-Embed which aims to produce high-quality clustering results at a low computational cost. We tackle two computational issues of the current OTU hierarchical clustering approach: (1) the compute-intensive sequence alignment operation for building the distance matrix and (2) the quadratic memory requirement of the clustering procedure.
Our performance evaluation shows that CRiSPy-Embed achieves higher efficiency in terms of both runtime and memory consumption in comparison to existing dendrogram-based approaches. Furthermore, to obtain the final OTU grouping, CRiSPy-Embed dynamically determines a natural cutoff of the dendrogram. With this strategy, CRiSPy-Embed achieves better and more robust clustering outcomes compared to other notable OTU clustering pipelines.

References

[1]
Ronaghi, M. Uhlén, and P. Nyrén. "A sequencing method based on real-time pyrophosphate". Science, volume 281, no. 5375, 1998.
[2]
M. L. Sogin, H. G. Morrison, J. A. Huber, et al. "Microbial diversity in the deep sea and the underexplored 'rare biosphere'". Proceedings of the National Academy of Sciences, volume 103, no. 32, pp. 12115--12120, 2006.
[3]
P. J. Turnbaugh, M. Hamady, T. Yatsunenko, et al. "A core gut microbiome in obese and lean twins". Nature, volume 457, no. 7228, pp. 480--4, 2009.
[4]
P. J. Turnbaugh, V. K. Ridaura, J. J. Faith, et al. "The effect of diet on the human gut microbiome: a metagenomic analysis in humanized gnotobiotic mice". Science Translational Medicine, volume 1, no. 6, p. 6ra14, 2009.
[5]
P. D. Schloss, S. L. Westcott, T. Ryabin, et al. "Introducing mothur: Open-Source Platform-Independent Community Supported Software for Describing and Comparing Microbial Communities". Applied and Environmental Microbiology, volume 75, no. 23, pp. 7537--7541, 2009.
[6]
Y. Sun, Y. Cai, L. Liu, et al. "ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences." Nucleic Acids Research, volume 37, no. 10, p. e76, 2009.
[7]
Y. Cai and Y. Sun. "ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time". Nucleic Acids Research, volume 39, no. 14, p. e95, 2011.
[8]
J. G. Caporaso, J. Kuczynski, J. Stombaugh, et al. "QIIME allows analysis of high-throughput community sequencing data." Nature Methods, volume 7, no. 5, pp. 335--6, 2010.
[9]
R. C. Edgar. "Search and clustering orders of magnitude faster than BLAST". Bioinformatics, volume 26, no. 19, pp. 2460--2461, 2010.
[10]
W. Li, L. Fu, B. Niu, et al. "Ultrafast clustering algorithms for metagenomic sequence analysis". Briefings in Bioinformatics, 2012.
[11]
Z. Zheng, S. Kramer, and B. Schmidt. "DySC: Software for Greedy Clustering of 16S rRNA Reads". Bioinformatics, 2012.
[12]
R. C. Edgar. "UPARSE: highly accurate OTU sequences from microbial amplicon reads". Nature Methods, volume 10, no. 10, pp. 996--8, 2013.
[13]
Y. Sun, Y. Cai, S. M. Huse, et al. "A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis". Briefings in Bioinformatics, 2011.
[14]
S. B. Needleman and C. D. Wunsch. "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology, volume 48, no. 3, pp. 443--453, 1970.
[15]
S. M. Huse, D. M. Welch, H. G. Morrison, et al. "Ironing out the wrinkles in the rare biosphere through improved OTU clustering". Environmental Microbiology, volume 12, no. 7, pp. 1889--1898, 2010.
[16]
Y. Loewenstein, E. Portugaly, M. Fromer, et al. "Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space". Bioinformatics, volume 24, no. 13, pp. i41--9, 2008.
[17]
G. Blackshields, F. Sievers, W. Shi, et al. "Sequence embedding for fast construction of guide trees for multiple sequence alignment". Algorithms for Molecular Biology, volume 5, p. 21, 2010.
[18]
F. Sievers, A. Wilm, D. Dineen, et al. "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega". Molecular Systems Biology, volume 7, p. 539, 2011.
[19]
M. J. Bonder, S. Abeln, E. Zaura, et al. "Comparing clustering and pre-processing in taxonomy analysis". Bioinformatics, volume 28, no. 22, pp. 2891--2897, 2012.
[20]
C. Quince, A. Lanzen, R. J. Davenport, et al. "Removing noise from pyrosequenced amplicons". BMC Bioinformatics, volume 12, no. 1, p. 38, 2011.
[21]
R. C. Edgar, B. J. Haas, J. C. Clemente, et al. "UCHIME improves sensitivity and speed of chimera detection". Bioinformatics, volume 27, no. 16, 2011.
[22]
N. Linial, E. London, and Y. Rabinovich. "The Geometry of Graphs and Some of Its Algorithmic Applications". Combinatorica, volume 15, pp. 577--591, 1994.
[23]
N. Bell and J. Hoberock. Thrust: A Productivity-Oriented Library for CUDA, pp. 359--371. Morgan Kaufmann Pub, jade edition, 2011.
[24]
D. Knuth. "External Sorting". D. Knuth, editor, The Art Of Computer Programming, Volume 3: Sorting and Searching, chapter 5, pp. 248--379. Addison-Wesley, 2nd edition, 1998.
[25]
"Agglomerative hierarchical cluster tree - MATLAB linkage".
[26]
D. Müllner. "fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python". Journal of Statistical Software, volume 53, no. 9, pp. 1--18, 2011.
[27]
X. Wang, J. Yao, Y. Sun, et al. "M-pick, a modularity-based method for OTU picking of 16S rRNA sequences". BMC Bioinformatics, volume 14, p. 43, 2013.
[28]
R. Killick, P. Fearnhead, and I. A. Eckley. "Optimal Detection of Changepoints With a Linear Computational Cost". Journal of the American Statistical Association, volume 107, no. 500, pp. 1590--1598, 2012.
[29]
N. Nethercote and J. Seward. "Valgrind". ACM SIGPLAN Notices, volume 42, no. 6, p. 89, 2007.
[30]
F. E. Angly, D. Willner, F. Rohwer, et al. "Grinder: a versatile amplicon and shotgun sequence simulator". Nucleic Acids Research, volume 40, no. 12, p. e94, 2012.
[31]
T. Z. DeSantis, P. Hugenholtz, N. Larsen, et al. "Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB". Applied and Environmental Microbiology, volume 72, no. 7, pp. 5069--72, 2006.
[32]
M. N. Ashby, J. Rine, E. F. Mongodin, et al. "Serial analysis of rRNA genes and the unexpected dominance of rare members of microbial communities". Applied and Environmental Microbiology, volume 73, no. 14, pp. 4532--42, 2007.
[33]
S. Balzer, K. Malde, A. Lanzén, et al. "Characteristics of 454 pyrosequencing data--enabling realistic simulation with flowsim". Bioinformatics, volume 26, no. 18, pp. i420--5, 2010.
[34]
A. N. Albatineh, M. Niewiadomska-Bugaj, and D. Mihalko. "On Similarity Indices and Correction for Chance Agreement". Journal of Classification, volume 23, no. 2, pp. 301--313, 2006.
[35]
N. X. Vinh, J. Epps, and J. Bailey. "Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance". The Journal of Machine Learning Research, volume 11, pp. 2837--2854, 2010.
[36]
W. M. Rand. "Objective criteria for the evaluation of clustering methods". Journal of the American Statistical Association, volume 66, no. 336, pp. 846--850, 1971.

Cited By

View all
  • (2017)CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read AlignmentAlgorithms for Next-Generation Sequencing Data10.1007/978-3-319-59826-0_10(203-233)Online publication date: 19-Sep-2017
  • (2015)Efficient agglomerative hierarchical clustering for biological sequence analysisTENCON 2015 - 2015 IEEE Region 10 Conference10.1109/TENCON.2015.7373194(1-3)Online publication date: Nov-2015
  • (2015)Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence AnalysisProceedings of the 2015 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2015.90(310-319)Online publication date: 14-Nov-2015

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
September 2014
851 pages
ISBN:9781450328944
DOI:10.1145/2649387
  • General Chairs:
  • Pierre Baldi,
  • Wei Wang
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 September 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. hierarchical clustering
  2. sequence similarity measure

Qualifiers

  • Research-article

Conference

BCB '14
Sponsor:
BCB '14: ACM-BCB '14
September 20 - 23, 2014
California, Newport Beach

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2017)CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read AlignmentAlgorithms for Next-Generation Sequencing Data10.1007/978-3-319-59826-0_10(203-233)Online publication date: 19-Sep-2017
  • (2015)Efficient agglomerative hierarchical clustering for biological sequence analysisTENCON 2015 - 2015 IEEE Region 10 Conference10.1109/TENCON.2015.7373194(1-3)Online publication date: Nov-2015
  • (2015)Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence AnalysisProceedings of the 2015 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2015.90(310-319)Online publication date: 14-Nov-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media