research-article

Fast dendrogram-based OTU clustering using sequence embedding

Authors:

Thuy-Diem Nguyen,

Bertil Schmidt,

Chee-Keong KwohAuthors Info & Claims

BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

Pages 63 - 72

https://doi.org/10.1145/2649387.2649402

Published: 20 September 2014 Publication History

Abstract

Biodiversity assessment is an important step in a metagenomic processing pipeline. The biodiversity of a microbial metagenome is often estimated by grouping its 16S rRNA reads into operational taxonomic units or OTUs. These metagenomic datasets are typically large and hence require effective yet accurate computational methods for processing.

In this paper, we introduce a new hierarchical clustering method called CRiSPy-Embed which aims to produce high-quality clustering results at a low computational cost. We tackle two computational issues of the current OTU hierarchical clustering approach: (1) the compute-intensive sequence alignment operation for building the distance matrix and (2) the quadratic memory requirement of the clustering procedure.

Our performance evaluation shows that CRiSPy-Embed achieves higher efficiency in terms of both runtime and memory consumption in comparison to existing dendrogram-based approaches. Furthermore, to obtain the final OTU grouping, CRiSPy-Embed dynamically determines a natural cutoff of the dendrogram. With this strategy, CRiSPy-Embed achieves better and more robust clustering outcomes compared to other notable OTU clustering pipelines.

References

[1]

Ronaghi, M. Uhlén, and P. Nyrén. "A sequencing method based on real-time pyrophosphate". Science, volume 281, no. 5375, 1998.

[2]

M. L. Sogin, H. G. Morrison, J. A. Huber, et al. "Microbial diversity in the deep sea and the underexplored 'rare biosphere'". Proceedings of the National Academy of Sciences, volume 103, no. 32, pp. 12115--12120, 2006.

[3]

P. J. Turnbaugh, M. Hamady, T. Yatsunenko, et al. "A core gut microbiome in obese and lean twins". Nature, volume 457, no. 7228, pp. 480--4, 2009.

[4]

P. J. Turnbaugh, V. K. Ridaura, J. J. Faith, et al. "The effect of diet on the human gut microbiome: a metagenomic analysis in humanized gnotobiotic mice". Science Translational Medicine, volume 1, no. 6, p. 6ra14, 2009.

[5]

P. D. Schloss, S. L. Westcott, T. Ryabin, et al. "Introducing mothur: Open-Source Platform-Independent Community Supported Software for Describing and Comparing Microbial Communities". Applied and Environmental Microbiology, volume 75, no. 23, pp. 7537--7541, 2009.

[6]

Y. Sun, Y. Cai, L. Liu, et al. "ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences." Nucleic Acids Research, volume 37, no. 10, p. e76, 2009.

[7]

Y. Cai and Y. Sun. "ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time". Nucleic Acids Research, volume 39, no. 14, p. e95, 2011.

[8]

J. G. Caporaso, J. Kuczynski, J. Stombaugh, et al. "QIIME allows analysis of high-throughput community sequencing data." Nature Methods, volume 7, no. 5, pp. 335--6, 2010.

[9]

R. C. Edgar. "Search and clustering orders of magnitude faster than BLAST". Bioinformatics, volume 26, no. 19, pp. 2460--2461, 2010.

Digital Library

[10]

W. Li, L. Fu, B. Niu, et al. "Ultrafast clustering algorithms for metagenomic sequence analysis". Briefings in Bioinformatics, 2012.

[11]

Z. Zheng, S. Kramer, and B. Schmidt. "DySC: Software for Greedy Clustering of 16S rRNA Reads". Bioinformatics, 2012.

Digital Library

[12]

R. C. Edgar. "UPARSE: highly accurate OTU sequences from microbial amplicon reads". Nature Methods, volume 10, no. 10, pp. 996--8, 2013.

[13]

Y. Sun, Y. Cai, S. M. Huse, et al. "A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis". Briefings in Bioinformatics, 2011.

[14]

S. B. Needleman and C. D. Wunsch. "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology, volume 48, no. 3, pp. 443--453, 1970.

[15]

S. M. Huse, D. M. Welch, H. G. Morrison, et al. "Ironing out the wrinkles in the rare biosphere through improved OTU clustering". Environmental Microbiology, volume 12, no. 7, pp. 1889--1898, 2010.

[16]

Y. Loewenstein, E. Portugaly, M. Fromer, et al. "Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space". Bioinformatics, volume 24, no. 13, pp. i41--9, 2008.

Digital Library

[17]

G. Blackshields, F. Sievers, W. Shi, et al. "Sequence embedding for fast construction of guide trees for multiple sequence alignment". Algorithms for Molecular Biology, volume 5, p. 21, 2010.

[18]

F. Sievers, A. Wilm, D. Dineen, et al. "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega". Molecular Systems Biology, volume 7, p. 539, 2011.

[19]

M. J. Bonder, S. Abeln, E. Zaura, et al. "Comparing clustering and pre-processing in taxonomy analysis". Bioinformatics, volume 28, no. 22, pp. 2891--2897, 2012.

Digital Library

[20]

C. Quince, A. Lanzen, R. J. Davenport, et al. "Removing noise from pyrosequenced amplicons". BMC Bioinformatics, volume 12, no. 1, p. 38, 2011.

[21]

R. C. Edgar, B. J. Haas, J. C. Clemente, et al. "UCHIME improves sensitivity and speed of chimera detection". Bioinformatics, volume 27, no. 16, 2011.

Digital Library

[22]

N. Linial, E. London, and Y. Rabinovich. "The Geometry of Graphs and Some of Its Algorithmic Applications". Combinatorica, volume 15, pp. 577--591, 1994.

[23]

N. Bell and J. Hoberock. Thrust: A Productivity-Oriented Library for CUDA, pp. 359--371. Morgan Kaufmann Pub, jade edition, 2011.

[24]

D. Knuth. "External Sorting". D. Knuth, editor, The Art Of Computer Programming, Volume 3: Sorting and Searching, chapter 5, pp. 248--379. Addison-Wesley, 2nd edition, 1998.

[25]

"Agglomerative hierarchical cluster tree - MATLAB linkage".

[26]

D. Müllner. "fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python". Journal of Statistical Software, volume 53, no. 9, pp. 1--18, 2011.

[27]

X. Wang, J. Yao, Y. Sun, et al. "M-pick, a modularity-based method for OTU picking of 16S rRNA sequences". BMC Bioinformatics, volume 14, p. 43, 2013.

[28]

R. Killick, P. Fearnhead, and I. A. Eckley. "Optimal Detection of Changepoints With a Linear Computational Cost". Journal of the American Statistical Association, volume 107, no. 500, pp. 1590--1598, 2012.

[29]

N. Nethercote and J. Seward. "Valgrind". ACM SIGPLAN Notices, volume 42, no. 6, p. 89, 2007.

Digital Library

[30]

F. E. Angly, D. Willner, F. Rohwer, et al. "Grinder: a versatile amplicon and shotgun sequence simulator". Nucleic Acids Research, volume 40, no. 12, p. e94, 2012.

[31]

T. Z. DeSantis, P. Hugenholtz, N. Larsen, et al. "Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB". Applied and Environmental Microbiology, volume 72, no. 7, pp. 5069--72, 2006.

[32]

M. N. Ashby, J. Rine, E. F. Mongodin, et al. "Serial analysis of rRNA genes and the unexpected dominance of rare members of microbial communities". Applied and Environmental Microbiology, volume 73, no. 14, pp. 4532--42, 2007.

[33]

S. Balzer, K. Malde, A. Lanzén, et al. "Characteristics of 454 pyrosequencing data--enabling realistic simulation with flowsim". Bioinformatics, volume 26, no. 18, pp. i420--5, 2010.

Digital Library

[34]

A. N. Albatineh, M. Niewiadomska-Bugaj, and D. Mihalko. "On Similarity Indices and Correction for Chance Agreement". Journal of Classification, volume 23, no. 2, pp. 301--313, 2006.

[35]

N. X. Vinh, J. Epps, and J. Bailey. "Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance". The Journal of Machine Learning Research, volume 11, pp. 2837--2854, 2010.

Digital Library

[36]

W. M. Rand. "Objective criteria for the evaluation of clustering methods". Journal of the American Statistical Association, volume 66, no. 336, pp. 846--850, 1971.

Cited By

Liu YSchmidt B(2017)CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read AlignmentAlgorithms for Next-Generation Sequencing Data10.1007/978-3-319-59826-0_10(203-233)Online publication date: 19-Sep-2017
https://doi.org/10.1007/978-3-319-59826-0_10
Nguyen TKwoh C(2015)Efficient agglomerative hierarchical clustering for biological sequence analysisTENCON 2015 - 2015 IEEE Region 10 Conference10.1109/TENCON.2015.7373194(1-3)Online publication date: Nov-2015
https://doi.org/10.1109/TENCON.2015.7373194
Mao QZheng WWang LCai YMai VSun Y(2015)Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence AnalysisProceedings of the 2015 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2015.90(310-319)Online publication date: 14-Nov-2015
https://dl.acm.org/doi/10.1109/ICDM.2015.90

Index Terms

Fast dendrogram-based OTU clustering using sequence embedding
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

DNA clustering and genome complexity

Early global measures of genome complexity (power spectra, the analysis of fluctuations in DNA walks or compositional segmentation) uncovered a high degree of complexity in eukaryotic genome sequences. The main evolutionary mechanisms leading to ...
Clustering 16S rRNA for OTU prediction

Motivation: With the advancements of next-generation sequencing technology, it is now possible to study samples directly obtained from the environment. Particularly, 16S rRNA gene sequences have been frequently used to profile the diversity of ...
Estimating sequence similarity from read sets for clustering next-generation sequencing data

Computing mutual similarity of biological sequences such as DNA molecules is essential for significant biological tasks such as hierarchical clustering of genomes. Current sequencing technologies do not provide the content of entire biological sequences;...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics

September 2014

851 pages

ISBN:9781450328944

DOI:10.1145/2649387

General Chairs:
Pierre Baldi
University of California, Irvine
,
Wei Wang
University of California, Los Angeles

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGBio: ACM Special Interest Group on Bioinformatics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 September 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

BCB '14

Sponsor:

SIGBio

BCB '14: ACM-BCB '14

September 20 - 23, 2014

California, Newport Beach

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
147
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YSchmidt B(2017)CUSHAW Suite: Parallel and Efficient Algorithms for NGS Read AlignmentAlgorithms for Next-Generation Sequencing Data10.1007/978-3-319-59826-0_10(203-233)Online publication date: 19-Sep-2017
https://doi.org/10.1007/978-3-319-59826-0_10
Nguyen TKwoh C(2015)Efficient agglomerative hierarchical clustering for biological sequence analysisTENCON 2015 - 2015 IEEE Region 10 Conference10.1109/TENCON.2015.7373194(1-3)Online publication date: Nov-2015
https://doi.org/10.1109/TENCON.2015.7373194
Mao QZheng WWang LCai YMai VSun Y(2015)Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence AnalysisProceedings of the 2015 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2015.90(310-319)Online publication date: 14-Nov-2015
https://dl.acm.org/doi/10.1109/ICDM.2015.90

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents