skip to main content
article

Compression of Annotated Nucleotide Sequences

Published:01 July 2007Publication History
Skip Abstract Section

Abstract

This article introduces an algorithm for the lossless compression of DNA files, which contain annotation text besides the nucleotide sequence. First a grammar is specifically designed to capture the regularities of the annotation text. A revertible transformation uses the grammar rules in order to equivalently represent the original file as a collection of parsed segments and a sequence of decisions made by the grammar parser. This decomposition enables the efficient use of state-of-the-art encoders for processing the parsed segments. The output size of the decision-making process of the grammar is optimized by extending the states to account for high-order Markovian dependencies. The practical implementation of the algorithm achieves a significant improvement when compared to the general-purpose methods currently used for DNA files.

References

  1. A. Apostolico and A.S. Fraenkel, “Robust Transmission of Unbounded Strings Using Fibonacci Representations,” IEEE Trans. Information Theory, vol. 33, no. 2, pp. 238-245, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Bookstein and S.T. Klein, “Compression, Information Theory and Grammars: A Unified Approach,” ACM Trans. Information Systems, vol. 8, no. 1, pp. 27-49, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R.D. Cameron, “Source Encoding Using Syntactic Information Source Models,” IEEE Trans. Information Theory, vol. 34, no. 4, pp.843-850, 1988.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. X. Chen, S. Kwong, and M. Li, “A Compression Algorithm for DNA Sequences,” IEEE Eng. in Medicine and Biology, pp. 61-66, July/Aug. 2001.Google ScholarGoogle ScholarCross RefCross Ref
  5. X. Chen, M. Li, B. Ma, and J. Tromp, “DNACompress: Fast and Effective DNA Sequence Compression,” Bioinformatics, vol. 18, pp.1696-1698, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  6. J. Cheney, “Compressing XML with Multiplexed Hierarchical PPM Models,” Proc. Data Compression Conf. '01, pp. 163-172, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J.G. Cleary and I.H. Witten, “Data Compression Using Adaptive Coding and Partial String Matching,” IEEE Trans. Comm., vol. 32, no. 4, pp. 396-402, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  8. G.V. Cormack and R.N. Horspool, “Data Compression Using Dynamic Markov Modelling,” Computer J., vol. 30, no. 6, pp. 541-550, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Grumbach and F. Tahi, “Compression of DNA Sequences,” Proc. Data Compression Conf. '93, pp. 340-350, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Grumbach and F. Tahi, “A New Challenge for Compression Algorithms: Genetic Sequences,” J. Information Processing and Management, vol. 30, no. 6, pp. 875-886, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J.C. Kieffer and E.-H. Yang, “Grammar-Based Codes: A New Class of Universal Lossless Source Codes,” IEEE Trans. Information Theory, vol. 46, no. 3, pp. 737-754, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Korodi and I. Tabus, “An Efficient Normalized Maximum Likelihood Algorithm for DNA Sequence Compression,” ACM Trans. Information Systems, vol. 23, no. 1, pp. 3-34, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J.M. Lake, “Prediction by Grammatical Match,” Proc. Data Compression Conf. '00, pp. 153-162, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Lanctot, M. Li, and E. Yang, “Estimating DNA Sequence Entropy,” Proc. 11th Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 409-418, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Liefke and D. Suciu, “XMill: An Efficient Compressor for XML Data,” Proc. Special Interest Group on Management of Data '00, pp.153-164, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Loewenstern and P. Yianilos, “Significantly Lower Entropy Estimates for Natural DNA Sequences,” Proc. Data Compression Conf. '97, pp. 151-160, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Marsh and N. Sager, “Analysis and Processing of Compact Text,” Proc. Ninth Conf. Computational Linguistics, J. Horecký, ed., vol. 1, pp. 201-206, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Matsumoto, K. Sadakane, and H. Imai, “Biological Sequence Compression Algorithms,” Genome Informatics, vol. 11, pp. 43-52, 2000.Google ScholarGoogle Scholar
  19. A. Moffat, “Implementing the PPM Data Compression Scheme,” IEEE Trans. Comm., vol. 38, no. 11, pp. 1917-1921, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  20. C.G. Nevill-Manning and I.H. Witten, “Compression and Explanation Using Hierarchical Grammars,” Computer J., vol. 40, nos.2/3, pp. 103-113, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  21. J. Rissanen and G. Langdon, “Arithmetic Coding,” IBM J. Research and Development, vol. 23, no. 2, pp. 149-162, 1979.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. É. Rivals, J.P. Delahaye, M. Dauchet, and O. Delgrange, “A Guaranteed Compression Scheme for Repetitive DNA Sequences,” Technical Report IT-285, LIFL Lille I Univ., 1995.Google ScholarGoogle Scholar
  23. D. Shkarin, “PPM: One Step to Practicality,” Proc. IEEE Data Compression Conf. '02, pp. 202-211, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Stein, “Genome Annotation: From Sequence to Biology,” Nature Reviews Genetics, vol. 2, no. 7, pp. 493-503, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  25. I. Tabus, G. Korodi, and J. Rissanen, “DNA Sequence Compression Using the Normalized Maximum Likelihood Model for Discrete Regression,” Proc. Data Compression Conf. '03, pp. 253-262, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Tarhio, “On Compression of Parse Trees,” Proc. Eighth Symp. String Processing and Information Retrieval, pp. 205-211, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  27. H.E. Williams and J. Zobel, “Compression of Nucleotide Databases for Fast Searching,” Computer Applications in the Biosciences, vol. 13, no. 5, pp. 549-554, 1997.Google ScholarGoogle Scholar
  28. Nat'l Center for Biotechnology Information, http://www. ncbi.nlm.nih.gov/, 2007.Google ScholarGoogle Scholar
  29. The DDBJ/EMBL/GenBank Feature Table: Definition, http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html, 2007.Google ScholarGoogle Scholar

Index Terms

  1. Compression of Annotated Nucleotide Sequences

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in

                    Full Access

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader