skip to main content
article

Fast string sorting using order-preserving compression

Published:31 December 2005Publication History
Skip Abstract Section

Abstract

We give experimental evidence for the benefits of order-preserving compression in sorting algorithms. While, in general, any algorithm might benefit from compressed data because of reduced paging requirements, we identified two natural candidates that would further benefit from order-preserving compression, namely string-oriented sorting algorithms and word-RAM algorithms for keys of bounded length. The word-RAM model has some of the fastest known sorting algorithms in practice. These algorithms are designed for keys of bounded length, usually 32 or 64 bits, which limits their direct applicability for strings. One possibility is to use an order-preserving compression scheme, so that a bounded-key-length algorithm can be applied. For the case of standard algorithms, we took what is considered to be the among the fastest nonword RAM string sorting algorithms, Fast MKQSort, and measured its performance on compressed data. The Fast MKQSort algorithm of Bentley and Sedgewick is optimized to handle text strings. Our experiments show that order-compression techniques results in savings of approximately 15% over the same algorithm on noncompressed data. For the word-RAM, we modified Andersson's sorting algorithm to handle variable-length keys. The resulting algorithm is faster than the standard Unix sort by a factor of 1.5X. Last, we used an order-preserving scheme that is within a constant additive term of the optimal Hu--Tucker, but requires linear time rather than O(mlog m), where m = |Σ| is the size of the alphabet.

References

  1. Andersson, A. 1994. Faster deterministic sorting and searching in linear space. In Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science (FOCS 1996). 135--141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Andersson, A. and Nilsson, S. 1994. A new efficient radix sort. In FOCS: IEEE Symposium on Foundations of Computer Science (FOCS).Google ScholarGoogle Scholar
  3. Andersson, A. and Nilsson, S. 1998. Implementing radixsort. ACM Journal of Experimental Algorithms 3, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Andersson, A., Hagerup, T., Nilsson, S., and Raman, R. 1995. Sorting in linear time? In STOC: ACM Symposium on Theory of Computing (STOC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Antoshenkov, G. 1997. Dictionary-based order-preserving string compression. VLDB Journal: Very Large Data Bases 6, 1 (Jan.), 26--39. (Electronic edition.) Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bayer, P. J. 1975. Improved bounds on the costs of optimal and balanced binary search trees. Master's thesis. Massachussets Institute of Technology (MIT), Cambridge, MA.Google ScholarGoogle Scholar
  7. Bell, T. C., Cleary, J. G., and Witten, I. H. 1990. Text compression. Prentice Hall, Englewood Cliffs, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bentley, J. L. and Sedgewick, R. 1997. Fast algorithms for sorting and searching strings. In Proceedings of 8th ACM-SIAM Symposium on Discrete Algorithms (SODA '97). 360--369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Farach, M. and Thorup, M. 1998. String matching in lempel-ziv compressed strings. Algorithmica 20, 4, 388--404.Google ScholarGoogle ScholarCross RefCross Ref
  10. Gilbert, E. N. and Moore, E. F. 1959. Variable-length binary encoding. Bell Systems Technical Journal 38, 933--968.Google ScholarGoogle ScholarCross RefCross Ref
  11. Hu, T. C. 1973. A new proof of the T-C algorithm. SIAM Journal on Applied Mathematics 25, 1 (July), 83--94.Google ScholarGoogle ScholarCross RefCross Ref
  12. Kärkkäinen, J. and Ukknonen, E. 1996. Lempel-ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing (WSP '96). 141--155.Google ScholarGoogle Scholar
  13. Knuth, D. E. 1997. The art of computer programming: Fundamental algorithms, 3rd ed, vol. 1. Addison--Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Larmore, L. L. and Przytycka, T. M. 1998. The optimal alphabetic tree problem revisited. Journal of Algorithms 28, 1 (July), 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Moura, E., Navarro, G., and Ziviani, N. 1997. Indexing compressed text. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). Carleton University Press, Ottawa, Ontario. 95--111.Google ScholarGoogle Scholar
  16. Mumey, B. M. 1992. Some new results on constructing optimal alphabetic binary trees. Master's thesis, University of British Columbia, Vancouver, British Columbia.Google ScholarGoogle Scholar
  17. Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Technical Journal 27, 379--423, 623--656.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory, Vol.IT-24 5.Google ScholarGoogle Scholar

Index Terms

  1. Fast string sorting using order-preserving compression

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Journal of Experimental Algorithmics
            ACM Journal of Experimental Algorithmics  Volume 10, Issue
            2005
            291 pages
            ISSN:1084-6654
            EISSN:1084-6654
            DOI:10.1145/1064546
            Issue’s Table of Contents

            Copyright © 2005 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 31 December 2005
            Published in jea Volume 10, Issue

            Qualifiers

            • article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader