article

Compression of Annotated Nucleotide Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 4 Issue 3pp 447–457https://doi.org/10.1109/tcbb.2007.1017

Published:01 July 2007Publication History

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

This article introduces an algorithm for the lossless compression of DNA files, which contain annotation text besides the nucleotide sequence. First a grammar is specifically designed to capture the regularities of the annotation text. A revertible transformation uses the grammar rules in order to equivalently represent the original file as a collection of parsed segments and a sequence of decisions made by the grammar parser. This decomposition enables the efficient use of state-of-the-art encoders for processing the parsed segments. The output size of the decision-making process of the grammar is optimized by extending the states to account for high-order Markovian dependencies. The practical implementation of the algorithm achieves a significant improvement when compared to the general-purpose methods currently used for DNA files.

References

A. Apostolico and A.S. Fraenkel, “Robust Transmission of Unbounded Strings Using Fibonacci Representations,” IEEE Trans. Information Theory, vol. 33, no. 2, pp. 238-245, 1987. Google ScholarDigital Library
A. Bookstein and S.T. Klein, “Compression, Information Theory and Grammars: A Unified Approach,” ACM Trans. Information Systems, vol. 8, no. 1, pp. 27-49, 1990. Google ScholarDigital Library
R.D. Cameron, “Source Encoding Using Syntactic Information Source Models,” IEEE Trans. Information Theory, vol. 34, no. 4, pp.843-850, 1988.Google ScholarDigital Library
X. Chen, S. Kwong, and M. Li, “A Compression Algorithm for DNA Sequences,” IEEE Eng. in Medicine and Biology, pp. 61-66, July/Aug. 2001.Google ScholarCross Ref
X. Chen, M. Li, B. Ma, and J. Tromp, “DNACompress: Fast and Effective DNA Sequence Compression,” Bioinformatics, vol. 18, pp.1696-1698, 2002.Google ScholarCross Ref
J. Cheney, “Compressing XML with Multiplexed Hierarchical PPM Models,” Proc. Data Compression Conf. '01, pp. 163-172, 2001. Google ScholarDigital Library
J.G. Cleary and I.H. Witten, “Data Compression Using Adaptive Coding and Partial String Matching,” IEEE Trans. Comm., vol. 32, no. 4, pp. 396-402, 1984.Google ScholarCross Ref
G.V. Cormack and R.N. Horspool, “Data Compression Using Dynamic Markov Modelling,” Computer J., vol. 30, no. 6, pp. 541-550, 1987. Google ScholarDigital Library
S. Grumbach and F. Tahi, “Compression of DNA Sequences,” Proc. Data Compression Conf. '93, pp. 340-350, 1993.Google ScholarCross Ref
S. Grumbach and F. Tahi, “A New Challenge for Compression Algorithms: Genetic Sequences,” J. Information Processing and Management, vol. 30, no. 6, pp. 875-886, 1994. Google ScholarDigital Library
J.C. Kieffer and E.-H. Yang, “Grammar-Based Codes: A New Class of Universal Lossless Source Codes,” IEEE Trans. Information Theory, vol. 46, no. 3, pp. 737-754, 2000. Google ScholarDigital Library
G. Korodi and I. Tabus, “An Efficient Normalized Maximum Likelihood Algorithm for DNA Sequence Compression,” ACM Trans. Information Systems, vol. 23, no. 1, pp. 3-34, 2005. Google ScholarDigital Library
J.M. Lake, “Prediction by Grammatical Match,” Proc. Data Compression Conf. '00, pp. 153-162, 2000. Google ScholarDigital Library
K. Lanctot, M. Li, and E. Yang, “Estimating DNA Sequence Entropy,” Proc. 11th Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 409-418, 2000. Google ScholarDigital Library
H. Liefke and D. Suciu, “XMill: An Efficient Compressor for XML Data,” Proc. Special Interest Group on Management of Data '00, pp.153-164, 2000. Google ScholarDigital Library
D. Loewenstern and P. Yianilos, “Significantly Lower Entropy Estimates for Natural DNA Sequences,” Proc. Data Compression Conf. '97, pp. 151-160, 1997. Google ScholarDigital Library
E. Marsh and N. Sager, “Analysis and Processing of Compact Text,” Proc. Ninth Conf. Computational Linguistics, J. Horecký, ed., vol. 1, pp. 201-206, 1982. Google ScholarDigital Library
T. Matsumoto, K. Sadakane, and H. Imai, “Biological Sequence Compression Algorithms,” Genome Informatics, vol. 11, pp. 43-52, 2000.Google Scholar
A. Moffat, “Implementing the PPM Data Compression Scheme,” IEEE Trans. Comm., vol. 38, no. 11, pp. 1917-1921, 1990.Google ScholarCross Ref
C.G. Nevill-Manning and I.H. Witten, “Compression and Explanation Using Hierarchical Grammars,” Computer J., vol. 40, nos.2/3, pp. 103-113, 1997.Google ScholarCross Ref
J. Rissanen and G. Langdon, “Arithmetic Coding,” IBM J. Research and Development, vol. 23, no. 2, pp. 149-162, 1979.Google ScholarDigital Library
É. Rivals, J.P. Delahaye, M. Dauchet, and O. Delgrange, “A Guaranteed Compression Scheme for Repetitive DNA Sequences,” Technical Report IT-285, LIFL Lille I Univ., 1995.Google Scholar
D. Shkarin, “PPM: One Step to Practicality,” Proc. IEEE Data Compression Conf. '02, pp. 202-211, 2002. Google ScholarDigital Library
L. Stein, “Genome Annotation: From Sequence to Biology,” Nature Reviews Genetics, vol. 2, no. 7, pp. 493-503, 2001.Google ScholarCross Ref
I. Tabus, G. Korodi, and J. Rissanen, “DNA Sequence Compression Using the Normalized Maximum Likelihood Model for Discrete Regression,” Proc. Data Compression Conf. '03, pp. 253-262, 2003. Google ScholarDigital Library
J. Tarhio, “On Compression of Parse Trees,” Proc. Eighth Symp. String Processing and Information Retrieval, pp. 205-211, 2001.Google ScholarCross Ref
H.E. Williams and J. Zobel, “Compression of Nucleotide Databases for Fast Searching,” Computer Applications in the Biosciences, vol. 13, no. 5, pp. 549-554, 1997.Google Scholar
Nat'l Center for Biotechnology Information, http://www. ncbi.nlm.nih.gov/, 2007.Google Scholar
The DDBJ/EMBL/GenBank Feature Table: Definition, http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html, 2007.Google Scholar

Index Terms

Recommendations

Comparing Compressed Sequences for Faster Nucleotide BLAST Searches

Molecular biologists, geneticists, and other life scientists use the BLAST homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of BLAST: BLASTP for ...
Read More
Classifying Nucleotide Sequences and their Positions of Influenza A Viruses through Several Kernels
ICPRAM 2015: Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1

In this paper, we classify nucleotide sequences and their positions of influenza A viruses by using both nucleotide sequence kernels and phylogenetic tree kernels. In the nucleotide sequence kernel, we regard a nucleotide sequence as a vector, a ...
Read More
Generalized LR Parsing Algorithm for Grammars with One-Sided Contexts

The Generalized LR parsing algorithm for context-free grammars is notable for having a decent worst-case running time (cubic in the length of the input string, if implemented efficiently), as well as much better performance on "good" grammars. This ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 4, Issue 3
July 2007
192 pages
ISSN:1545-5963
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
IEEE Computer Society Press
Washington, DC, United States
Publication History
- Published: 1 July 2007
Published in tcbb Volume 4, Issue 3
Author Tags
4 [Data]: Coding and Information Theory | Data compaction and compression
Annotation
Compression
F.4 [Theory of Computation]: Mathematical Logic and Formal Languages | Formal languages
Formal Grammars
G.3 [Mathematics of Computing]: Probability and Statistics | Markov processes
J.3 [Computer Applications]: Life and Medical Sciences | Biology and genetics
Nucleotide sequences
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 261
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Compression of Annotated Nucleotide Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

References

Cited By

Index Terms

Recommendations

Comparing Compressed Sequences for Faster Nucleotide BLAST Searches

Classifying Nucleotide Sequences and their Positions of Influenza A Viruses through Several Kernels

Generalized LR Parsing Algorithm for Grammars with One-Sided Contexts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Compression of Annotated Nucleotide Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

References

Cited By

Index Terms

Recommendations

Comparing Compressed Sequences for Faster Nucleotide BLAST Searches

Classifying Nucleotide Sequences and their Positions of Influenza A Viruses through Several Kernels

Generalized LR Parsing Algorithm for Grammars with One-Sided Contexts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media