skip to main content
research-article

Dynamic lightweight text compression

Published:02 July 2010Publication History
Skip Abstract Section

Abstract

We address the problem of adaptive compression of natural language text, considering the case where the receiver is much less powerful than the sender, as in mobile applications. Our techniques achieve compression ratios around 32% and require very little effort from the receiver. Furthermore, the receiver is not only lighter, but it can also search the compressed text with less work than that necessary to decompress it. This is a novelty in two senses: it breaks the usual compressor/decompressor symmetry typical of adaptive schemes, and it contradicts the long-standing assumption that only semistatic codes could be searched more efficiently than the uncompressed text. Our novel compression methods are preferable in several aspects over the existing adaptive and semistatic compressors for natural language texts.

References

  1. Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley Longman, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bentley, J. L., Sleator, D. D., Tarjan, R. E., and Wei, V. K. 1986. A locally adaptive data compression scheme. Comm. ACM 29, 4, 320--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Boyer, R. S. and Moore, J. S. 1977. A fast string searching algorithm. Comm. ACM 20, 10, 762--772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Brisaboa, N., Fariña, A., Navarro, G., and Paramá, J. 2005a. Compressing dynamic text collections via phrase-based coding. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL'05). Lecture Notes in Computer Science, vol. 3652. Springer-Verlag, 462--474. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Brisaboa, N., Fariña, A., Navarro, G., and Paramá, J. 2005b. Efficiently decodable and searchable natural language adaptive compression. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). ACM Press, 234--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Brisaboa, N., Fariña, A., Navarro, G., and Paramá, J. 2007. Lightweight natural language text compression. Inform. Retriev. 10, 1, 1--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Brisaboa, N., Fariña, A., Navarro, G., and Paramá, J. 2008. New adaptive compressors for natural language text. Softw. Pract. Exper. 38, 1429--1450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Burrows, M. and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Tech. rep. 124, Digital Equipment Corporation, Palo Alto, CA.Google ScholarGoogle Scholar
  9. Cleary, J. G. and Witten, I. H. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans. Comm. 32, 4, 396--402.Google ScholarGoogle ScholarCross RefCross Ref
  10. Culpepper, J. and Moffat, A. 2005. Enhanced byte codes with restricted prefix properties. In Proceedings of the 12th International Symposium on String Processing and Information Retrieval (SPIRE'05). Lecture Notes in Computer Science, vol. 3772. Springer-Verlag, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Culpepper, J. S. and Moffat, A. 2006. Phrase-based pattern matching in compressed text. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE'06). Lecture Notes in Computer Science, vol. 4209. Springer-Verlag, 337--345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fariña, A. 2005. New compression codes for text databases. Ph.D. thesis, Database Laboratory, University of A Coruña, Spain. http://coba.dc.fi.udc.es/~fari/phd/.Google ScholarGoogle Scholar
  13. Heaps, H. S. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Horspool, R. N. 1980. Practical fast searching in strings. Softw. Pract. Exper. 10, 6, 501--506.Google ScholarGoogle ScholarCross RefCross Ref
  15. Moffat, A. 1989. Word-based text compression. Softw. Pract. Exper. 19, 2, 185--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Moura, E., Navarro, G., Ziviani, N., and Baeza-Yates, R. 1998. Fast searching on compressed text allowing errors. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98). ACM Press, 298--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Moura, E., Navarro, G., Ziviani, N., and Baeza-Yates, R. 2000. Fast and flexible word searching on compressed text. ACM Trans. Inform. Syst. 18, 2, 113--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Navarro, G., Moura, E., Neubert, M., Ziviani, N., and Baeza-Yates, R. 2000. Adding compression to block addressing inverted indexes. Inform. Retriev. 3, 1, 49--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Navarro, G. and Raffinot, M. 2002. Flexible Pattern Matching in Strings—Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Navarro, G. and Tarhio, J. 2005. LZgrep: A Boyer-Moore string matching tool for Ziv-Lempel compressed text. Softw. Pract. Exper. 35, 12, 1107--1130. http://www.dcc.uchile.cl/~gnavarro/software/lzgrep.tar.gz. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Turpin, A. and Moffat, A. 1997a. Efficient approximate adaptive coding. In Proceedings of the Conference on Data Compression (DCC'97). IEEE Computer Society, 357--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Turpin, A. and Moffat, A. 1997b. Fast file search using text compression. In Proceedings of the 20th Australian Computer Science Conference (ACSC'97). 1--8.Google ScholarGoogle Scholar
  23. Williams, H. E. and Zobel, J. 1999. Compressing integers for fast file access. Comput. J. 42, 3, 193--201.Google ScholarGoogle ScholarCross RefCross Ref
  24. Wu, S. and Manber, U. 1992a. Agrep—a fast approximate pattern-matching tool. In Proceedings of the USENIX Winter Technical Conference. 153--162.Google ScholarGoogle Scholar
  25. Wu, S. and Manber, U. 1992b. Fast text searching allowing errors. Comm. ACM 35, 10, 83--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wu, S. and Manber, U. 1994. A fast algorithm for multi-pattern searching. Report TR-94-17, Department of Computer Science, University of Arizona, Tucson, AZ.Google ScholarGoogle Scholar
  27. Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading, MA.Google ScholarGoogle Scholar
  28. Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23, 3, 337--343.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24, 5, 530--536.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ziviani, N., Moura, E., Navarro, G., and Baeza-Yates, R. 2000. Compression: A key for next-generation text retrieval systems. IEEE Computer 33, 11, 37--44. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Dynamic lightweight text compression

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Information Systems
        ACM Transactions on Information Systems  Volume 28, Issue 3
        June 2010
        231 pages
        ISSN:1046-8188
        EISSN:1558-2868
        DOI:10.1145/1777432
        Issue’s Table of Contents

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 July 2010
        • Accepted: 1 June 2009
        • Revised: 1 April 2009
        • Received: 1 September 2008
        Published in tois Volume 28, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader