Abstract
This article introduces an algorithm for the lossless compression of DNA files, which contain annotation text besides the nucleotide sequence. First a grammar is specifically designed to capture the regularities of the annotation text. A revertible transformation uses the grammar rules in order to equivalently represent the original file as a collection of parsed segments and a sequence of decisions made by the grammar parser. This decomposition enables the efficient use of state-of-the-art encoders for processing the parsed segments. The output size of the decision-making process of the grammar is optimized by extending the states to account for high-order Markovian dependencies. The practical implementation of the algorithm achieves a significant improvement when compared to the general-purpose methods currently used for DNA files.
- A. Apostolico and A.S. Fraenkel, “Robust Transmission of Unbounded Strings Using Fibonacci Representations,” IEEE Trans. Information Theory, vol. 33, no. 2, pp. 238-245, 1987. Google ScholarDigital Library
- A. Bookstein and S.T. Klein, “Compression, Information Theory and Grammars: A Unified Approach,” ACM Trans. Information Systems, vol. 8, no. 1, pp. 27-49, 1990. Google ScholarDigital Library
- R.D. Cameron, “Source Encoding Using Syntactic Information Source Models,” IEEE Trans. Information Theory, vol. 34, no. 4, pp.843-850, 1988.Google ScholarDigital Library
- X. Chen, S. Kwong, and M. Li, “A Compression Algorithm for DNA Sequences,” IEEE Eng. in Medicine and Biology, pp. 61-66, July/Aug. 2001.Google ScholarCross Ref
- X. Chen, M. Li, B. Ma, and J. Tromp, “DNACompress: Fast and Effective DNA Sequence Compression,” Bioinformatics, vol. 18, pp.1696-1698, 2002.Google ScholarCross Ref
- J. Cheney, “Compressing XML with Multiplexed Hierarchical PPM Models,” Proc. Data Compression Conf. '01, pp. 163-172, 2001. Google ScholarDigital Library
- J.G. Cleary and I.H. Witten, “Data Compression Using Adaptive Coding and Partial String Matching,” IEEE Trans. Comm., vol. 32, no. 4, pp. 396-402, 1984.Google ScholarCross Ref
- G.V. Cormack and R.N. Horspool, “Data Compression Using Dynamic Markov Modelling,” Computer J., vol. 30, no. 6, pp. 541-550, 1987. Google ScholarDigital Library
- S. Grumbach and F. Tahi, “Compression of DNA Sequences,” Proc. Data Compression Conf. '93, pp. 340-350, 1993.Google ScholarCross Ref
- S. Grumbach and F. Tahi, “A New Challenge for Compression Algorithms: Genetic Sequences,” J. Information Processing and Management, vol. 30, no. 6, pp. 875-886, 1994. Google ScholarDigital Library
- J.C. Kieffer and E.-H. Yang, “Grammar-Based Codes: A New Class of Universal Lossless Source Codes,” IEEE Trans. Information Theory, vol. 46, no. 3, pp. 737-754, 2000. Google ScholarDigital Library
- G. Korodi and I. Tabus, “An Efficient Normalized Maximum Likelihood Algorithm for DNA Sequence Compression,” ACM Trans. Information Systems, vol. 23, no. 1, pp. 3-34, 2005. Google ScholarDigital Library
- J.M. Lake, “Prediction by Grammatical Match,” Proc. Data Compression Conf. '00, pp. 153-162, 2000. Google ScholarDigital Library
- K. Lanctot, M. Li, and E. Yang, “Estimating DNA Sequence Entropy,” Proc. 11th Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 409-418, 2000. Google ScholarDigital Library
- H. Liefke and D. Suciu, “XMill: An Efficient Compressor for XML Data,” Proc. Special Interest Group on Management of Data '00, pp.153-164, 2000. Google ScholarDigital Library
- D. Loewenstern and P. Yianilos, “Significantly Lower Entropy Estimates for Natural DNA Sequences,” Proc. Data Compression Conf. '97, pp. 151-160, 1997. Google ScholarDigital Library
- E. Marsh and N. Sager, “Analysis and Processing of Compact Text,” Proc. Ninth Conf. Computational Linguistics, J. Horecký, ed., vol. 1, pp. 201-206, 1982. Google ScholarDigital Library
- T. Matsumoto, K. Sadakane, and H. Imai, “Biological Sequence Compression Algorithms,” Genome Informatics, vol. 11, pp. 43-52, 2000.Google Scholar
- A. Moffat, “Implementing the PPM Data Compression Scheme,” IEEE Trans. Comm., vol. 38, no. 11, pp. 1917-1921, 1990.Google ScholarCross Ref
- C.G. Nevill-Manning and I.H. Witten, “Compression and Explanation Using Hierarchical Grammars,” Computer J., vol. 40, nos.2/3, pp. 103-113, 1997.Google ScholarCross Ref
- J. Rissanen and G. Langdon, “Arithmetic Coding,” IBM J. Research and Development, vol. 23, no. 2, pp. 149-162, 1979.Google ScholarDigital Library
- É. Rivals, J.P. Delahaye, M. Dauchet, and O. Delgrange, “A Guaranteed Compression Scheme for Repetitive DNA Sequences,” Technical Report IT-285, LIFL Lille I Univ., 1995.Google Scholar
- D. Shkarin, “PPM: One Step to Practicality,” Proc. IEEE Data Compression Conf. '02, pp. 202-211, 2002. Google ScholarDigital Library
- L. Stein, “Genome Annotation: From Sequence to Biology,” Nature Reviews Genetics, vol. 2, no. 7, pp. 493-503, 2001.Google ScholarCross Ref
- I. Tabus, G. Korodi, and J. Rissanen, “DNA Sequence Compression Using the Normalized Maximum Likelihood Model for Discrete Regression,” Proc. Data Compression Conf. '03, pp. 253-262, 2003. Google ScholarDigital Library
- J. Tarhio, “On Compression of Parse Trees,” Proc. Eighth Symp. String Processing and Information Retrieval, pp. 205-211, 2001.Google ScholarCross Ref
- H.E. Williams and J. Zobel, “Compression of Nucleotide Databases for Fast Searching,” Computer Applications in the Biosciences, vol. 13, no. 5, pp. 549-554, 1997.Google Scholar
- Nat'l Center for Biotechnology Information, http://www. ncbi.nlm.nih.gov/, 2007.Google Scholar
- The DDBJ/EMBL/GenBank Feature Table: Definition, http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html, 2007.Google Scholar
Index Terms
- Compression of Annotated Nucleotide Sequences
Recommendations
Comparing Compressed Sequences for Faster Nucleotide BLAST Searches
Molecular biologists, geneticists, and other life scientists use the BLAST homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of BLAST: BLASTP for ...
Classifying Nucleotide Sequences and their Positions of Influenza A Viruses through Several Kernels
ICPRAM 2015: Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1In this paper, we classify nucleotide sequences and their positions of influenza A viruses by using both nucleotide sequence kernels and phylogenetic tree kernels. In the nucleotide sequence kernel, we regard a nucleotide sequence as a vector, a ...
Generalized LR Parsing Algorithm for Grammars with One-Sided Contexts
The Generalized LR parsing algorithm for context-free grammars is notable for having a decent worst-case running time (cubic in the length of the input string, if implemented efficiently), as well as much better performance on "good" grammars. This ...
Comments