Abstract
We address the problem of adaptive compression of natural language text, considering the case where the receiver is much less powerful than the sender, as in mobile applications. Our techniques achieve compression ratios around 32% and require very little effort from the receiver. Furthermore, the receiver is not only lighter, but it can also search the compressed text with less work than that necessary to decompress it. This is a novelty in two senses: it breaks the usual compressor/decompressor symmetry typical of adaptive schemes, and it contradicts the long-standing assumption that only semistatic codes could be searched more efficiently than the uncompressed text. Our novel compression methods are preferable in several aspects over the existing adaptive and semistatic compressors for natural language texts.
- Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley Longman, New York, NY. Google ScholarDigital Library
- Bentley, J. L., Sleator, D. D., Tarjan, R. E., and Wei, V. K. 1986. A locally adaptive data compression scheme. Comm. ACM 29, 4, 320--330. Google ScholarDigital Library
- Boyer, R. S. and Moore, J. S. 1977. A fast string searching algorithm. Comm. ACM 20, 10, 762--772. Google ScholarDigital Library
- Brisaboa, N., Fariña, A., Navarro, G., and Paramá, J. 2005a. Compressing dynamic text collections via phrase-based coding. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL'05). Lecture Notes in Computer Science, vol. 3652. Springer-Verlag, 462--474. Google ScholarDigital Library
- Brisaboa, N., Fariña, A., Navarro, G., and Paramá, J. 2005b. Efficiently decodable and searchable natural language adaptive compression. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). ACM Press, 234--241. Google ScholarDigital Library
- Brisaboa, N., Fariña, A., Navarro, G., and Paramá, J. 2007. Lightweight natural language text compression. Inform. Retriev. 10, 1, 1--33. Google ScholarDigital Library
- Brisaboa, N., Fariña, A., Navarro, G., and Paramá, J. 2008. New adaptive compressors for natural language text. Softw. Pract. Exper. 38, 1429--1450. Google ScholarDigital Library
- Burrows, M. and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Tech. rep. 124, Digital Equipment Corporation, Palo Alto, CA.Google Scholar
- Cleary, J. G. and Witten, I. H. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans. Comm. 32, 4, 396--402.Google ScholarCross Ref
- Culpepper, J. and Moffat, A. 2005. Enhanced byte codes with restricted prefix properties. In Proceedings of the 12th International Symposium on String Processing and Information Retrieval (SPIRE'05). Lecture Notes in Computer Science, vol. 3772. Springer-Verlag, 1--12. Google ScholarDigital Library
- Culpepper, J. S. and Moffat, A. 2006. Phrase-based pattern matching in compressed text. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE'06). Lecture Notes in Computer Science, vol. 4209. Springer-Verlag, 337--345. Google ScholarDigital Library
- Fariña, A. 2005. New compression codes for text databases. Ph.D. thesis, Database Laboratory, University of A Coruña, Spain. http://coba.dc.fi.udc.es/~fari/phd/.Google Scholar
- Heaps, H. S. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York, NY. Google ScholarDigital Library
- Horspool, R. N. 1980. Practical fast searching in strings. Softw. Pract. Exper. 10, 6, 501--506.Google ScholarCross Ref
- Moffat, A. 1989. Word-based text compression. Softw. Pract. Exper. 19, 2, 185--198. Google ScholarDigital Library
- Moura, E., Navarro, G., Ziviani, N., and Baeza-Yates, R. 1998. Fast searching on compressed text allowing errors. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98). ACM Press, 298--306. Google ScholarDigital Library
- Moura, E., Navarro, G., Ziviani, N., and Baeza-Yates, R. 2000. Fast and flexible word searching on compressed text. ACM Trans. Inform. Syst. 18, 2, 113--139. Google ScholarDigital Library
- Navarro, G., Moura, E., Neubert, M., Ziviani, N., and Baeza-Yates, R. 2000. Adding compression to block addressing inverted indexes. Inform. Retriev. 3, 1, 49--77. Google ScholarDigital Library
- Navarro, G. and Raffinot, M. 2002. Flexible Pattern Matching in Strings—Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- Navarro, G. and Tarhio, J. 2005. LZgrep: A Boyer-Moore string matching tool for Ziv-Lempel compressed text. Softw. Pract. Exper. 35, 12, 1107--1130. http://www.dcc.uchile.cl/~gnavarro/software/lzgrep.tar.gz. Google ScholarDigital Library
- Turpin, A. and Moffat, A. 1997a. Efficient approximate adaptive coding. In Proceedings of the Conference on Data Compression (DCC'97). IEEE Computer Society, 357--366. Google ScholarDigital Library
- Turpin, A. and Moffat, A. 1997b. Fast file search using text compression. In Proceedings of the 20th Australian Computer Science Conference (ACSC'97). 1--8.Google Scholar
- Williams, H. E. and Zobel, J. 1999. Compressing integers for fast file access. Comput. J. 42, 3, 193--201.Google ScholarCross Ref
- Wu, S. and Manber, U. 1992a. Agrep—a fast approximate pattern-matching tool. In Proceedings of the USENIX Winter Technical Conference. 153--162.Google Scholar
- Wu, S. and Manber, U. 1992b. Fast text searching allowing errors. Comm. ACM 35, 10, 83--91. Google ScholarDigital Library
- Wu, S. and Manber, U. 1994. A fast algorithm for multi-pattern searching. Report TR-94-17, Department of Computer Science, University of Arizona, Tucson, AZ.Google Scholar
- Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading, MA.Google Scholar
- Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23, 3, 337--343.Google ScholarDigital Library
- Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24, 5, 530--536.Google ScholarDigital Library
- Ziviani, N., Moura, E., Navarro, G., and Baeza-Yates, R. 2000. Compression: A key for next-generation text retrieval systems. IEEE Computer 33, 11, 37--44. Google ScholarDigital Library
Index Terms
- Dynamic lightweight text compression
Recommendations
A fast dynamic compression scheme for natural language texts
We adapted Word-based Tagged Code (WBTC) to obtain its dynamic version. The aim of designing a dynamic version of WBTC is to adapt it for real-time transmission. The problem in the semi-static technique is to perform two passes over the source text, and ...
Word-based text compression using the Burrows-Wheeler transform
Block-sorting is an innovative compression mechanism introduced in 1994 by Burrows and Wheeler. It involves three steps: permuting the input one block at a time through the use of the Burrows-Wheeler transform (bwt); applying a move-to-front (mtf) ...
Word-Based Statistical Compressors as Natural Language Compression Boosters
DCC '08: Proceedings of the Data Compression ConferenceSemistatic word-based byte-oriented compression codes are known to be attractive alternatives to compress natural language texts. With compression ratios around 30%, they allow direct pattern searching on the compressed text up to 8 times faster than on ...
Comments