skip to main content
review-article
Open Access

Computational biology in the 21st century: scaling with compressive algorithms

Published:22 July 2016Publication History
Skip Abstract Section

Abstract

Algorithmic advances take advantage of the structure of massive biological data landscape.

References

  1. 1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 7422 (2012), 56--65.Google ScholarGoogle ScholarCross RefCross Ref
  2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. Basic local alignment search tool. Journal of Molecular Biology 215, 3 (1990), 403--410.Google ScholarGoogle ScholarCross RefCross Ref
  3. Berger, B., Peng, J. and Singh, M. Computational solutions for omics data. Nature Reviews Genetics 14, 5 (2013), 333--346.Google ScholarGoogle ScholarCross RefCross Ref
  4. Bonfield, J.K. and Mahoney, M.V. Compression of FASTQ and SAM format sequencing data. PLoS ONE 8, 3 (2013), e59190.Google ScholarGoogle ScholarCross RefCross Ref
  5. Bredel, M. and Jacoby, E. Chemogenomics: An emerging strategy for rapid target and drug discovery. Nature Reviews Genetics 5, 4 (2004), 262--275.Google ScholarGoogle ScholarCross RefCross Ref
  6. Bruijn, D.N. A combinatorial problem. In Proceedings of the Koninklijke Nederlandse Akademie van Wetenschappen, Series A 49, 7 (1946), 758.Google ScholarGoogle Scholar
  7. Buchfink, B., Xie, C., and Huson, D.H. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 1 (2015), 59--60.Google ScholarGoogle ScholarCross RefCross Ref
  8. Candes, E.J. and Tao, T. Decoding by linear programming. IEEE Transactions on Information Theory 51, 12 (2005), 4203--4215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cao, M., Zhang, H., Park, J., Daniels, N.M., Crovella, M.E., Cowen, L.J. and Hescott, B. Going the distance for protein function prediction: A new distance metric for protein interaction networks. PLoS ONE 8, 10 (2013).Google ScholarGoogle ScholarCross RefCross Ref
  10. Chindelevitch, L., Trigg, J., Regev, A. and Berger, B. An exact arithmetic toolbox for a consistent and reproducible structural analysis of metabolic network models. Nature Communications 5, (2014).Google ScholarGoogle Scholar
  11. Cho, H., Berger, B., and Peng, J. Diffusion component analysis: Unraveling functional topology in biological networks. Research in Computational Molecular Biology. Springer, 2015, 62--64.Google ScholarGoogle Scholar
  12. Daniels, N.M., Gallant, A., Peng, J., Cowen, L.J., Baym, M. and Berger, M. Compressive genomics for protein databases. Bioinformatics 29 (2013), i283--i290.Google ScholarGoogle ScholarCross RefCross Ref
  13. Dobzhansky, T. Nothing in biology makes sense except in the light of evolution (1973).Google ScholarGoogle Scholar
  14. Forsberg, K.J., Reyes, A., Wang, B., Selleck, E.M., Sommer, M.O. and Dantas, G. The shared antibiotic resistome of soil bacteria and human pathogens. Science 337, 6098 (2012), 1107--1111.Google ScholarGoogle ScholarCross RefCross Ref
  15. Hach, F., Hormozdiari, F., Alkan, C., Hormozdiari, F., Birol, I., Eichler, E.E. and Sahinalp, S.C. mrsFAST: A cache-oblivious algorithm for short-read mapping. Nature Methods 7, 8 (2010), 576--577.Google ScholarGoogle ScholarCross RefCross Ref
  16. Hach, F., Sarra, I. Hormozdiari, F., Alkan, C., Eichler, E.E. and Sahinalp, S.C. mrsFAST-Ultra: a compact, SNP-aware mapper for high-performance sequencing applications. Nucleic Acids Research (2014), gku370.Google ScholarGoogle Scholar
  17. Hart, Y., Sheftel, H., Hausser, J., Szekely, P., Ben-Moshe, N.B., Korem, Y., Tendler, A., Mayo, A.E. and Alon, U. Inferring biological tasks using Pareto analysis of high-dimensional data. Nature Methods 12, 3 (2015), 233--235.Google ScholarGoogle ScholarCross RefCross Ref
  18. Indyk, P. and Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing. ACM, 1998, 604--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Janda, J.M. and Abbott, S.L. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls. J. Clinical Microbiology 45, 9 (2007), 2761--2764.Google ScholarGoogle ScholarCross RefCross Ref
  20. Jardine, N. and van Rijsbergen, C.J. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval 7, 5 (1971) 217--240.Google ScholarGoogle ScholarCross RefCross Ref
  21. Liao, C.-S., Lu, K., Baym, M., Singh, R. and Berger, B. IsoRankN: Spectral methods for global alignment of multiple protein networks. Bioinformatics 12 (2009), i253--i258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Loh, P.-R., Baym, M., and Berger, B. Compressive genomics. Nature Biotechnology 30, 7 (2012), 627--630.Google ScholarGoogle ScholarCross RefCross Ref
  23. MacFabe, D.F. Short-chain fatty acid fermentation products of the gut microbiome: Implications in autism spectrum disorders. Microbial Ecology in Health and Disease 23 (2012).Google ScholarGoogle Scholar
  24. Marco-Sola, S., Sammeth, M., Guigó, R. and Ribeca, P. The gem mapper: Fast, accurate and versatile alignment by filtration. Nature Methods 9, 12 (2012), 1185--1188.Google ScholarGoogle ScholarCross RefCross Ref
  25. Marx, V. Biology: The big challenges of big data. Nature 498, 7453 (2013), 255--260.Google ScholarGoogle ScholarCross RefCross Ref
  26. Ochoa, I., Asnani, H., Bharadia, D., Chowdhury, M., Weissman, T. and Yona, G. QualComp: A new lossy compressor for quality scores based on rate distortion theory. BMC bioinformatics 14, 1 (2013), 187.Google ScholarGoogle Scholar
  27. Patro, R. and Kingsford, C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics (2015).Google ScholarGoogle Scholar
  28. Prat, Y., Fromer, M., Linial, N. and Linial, M. Recovering key biological constituents through sparse representation of gene expression. Bioinformatics 5 (2011), 655--661. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Rahman, S.A., Bashton, M., Holliday, G.L., Schrader, R. and Thornton, J.M. Small molecule subgraph detector (SMSD) toolkit. J. Cheminformatics 1, 1 (2009), 1--13.Google ScholarGoogle ScholarCross RefCross Ref
  30. Rubinfeld, R. and Shapira, A. Sublinear time algorithms. SIAM J. Discrete Mathematics 25, 4 (2011), 1562--1588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Schatz, M.C., Langmead, B. and Salzberg, S.L. Cloud computing and the DNA data race. Nature Biotechnology 28, 7 (2010), 691--693.Google ScholarGoogle ScholarCross RefCross Ref
  32. Singh, R., Xu, J. and Berger, B. Global alignment of multiple protein interaction networks with application to functional orthology detection. In Proceedings of the National Academy of Sciences 105, 35 (2008), 12763--12768.Google ScholarGoogle ScholarCross RefCross Ref
  33. Siragusa, E., Weese, D. and Reinert, K. Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Research 41, 7 (2013), e78.Google ScholarGoogle ScholarCross RefCross Ref
  34. Stephens, Z.D. et al. Big data: Astronomical or genomical? PLoS Biol. 13, 7 (2015), e1002195.Google ScholarGoogle ScholarCross RefCross Ref
  35. Uhlmann, J.K. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters 40, 4 (1991), 175--179.Google ScholarGoogle ScholarCross RefCross Ref
  36. Weinstein, J.N. et al. The cancer genome atlas pan-cancer analysis project. Nature Genetics 45, 10 (2013), 1113--1120.Google ScholarGoogle ScholarCross RefCross Ref
  37. Yorukoglu, D., Yu, Y.W., Peng, J. and Berger, B. Compressive mapping for next-generation sequencing Nature Biotechnology 4 (2016), 374--376.Google ScholarGoogle Scholar
  38. Yu, Y.W., Daniels, N., Danko, D.C. and Berger, B. Entropy-scaling search of massive biological data. Cell Systems 1, 2 (2015), 130--140.Google ScholarGoogle ScholarCross RefCross Ref
  39. Yu, Y.W., Yorukoglu, D., Peng, J. and Berger, B. Quality score compression improves genotyping accuracy. Nature Biotechnology 33, 3 (2015), 240--243.Google ScholarGoogle ScholarCross RefCross Ref
  40. Zhao, Y., Tang, H. and Ye, Y. RAPSearch2: A fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics 28, 1 (2012), 125--126. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Computational biology in the 21st century: scaling with compressive algorithms

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Communications of the ACM
          Communications of the ACM  Volume 59, Issue 8
          August 2016
          94 pages
          ISSN:0001-0782
          EISSN:1557-7317
          DOI:10.1145/2975594
          • Editor:
          • Moshe Y. Vardi
          Issue’s Table of Contents

          Copyright © 2016 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 July 2016

          Check for updates

          Qualifiers

          • review-article
          • Popular
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format