Abstract
We give experimental evidence for the benefits of order-preserving compression in sorting algorithms. While, in general, any algorithm might benefit from compressed data because of reduced paging requirements, we identified two natural candidates that would further benefit from order-preserving compression, namely string-oriented sorting algorithms and word-RAM algorithms for keys of bounded length. The word-RAM model has some of the fastest known sorting algorithms in practice. These algorithms are designed for keys of bounded length, usually 32 or 64 bits, which limits their direct applicability for strings. One possibility is to use an order-preserving compression scheme, so that a bounded-key-length algorithm can be applied. For the case of standard algorithms, we took what is considered to be the among the fastest nonword RAM string sorting algorithms, Fast MKQSort, and measured its performance on compressed data. The Fast MKQSort algorithm of Bentley and Sedgewick is optimized to handle text strings. Our experiments show that order-compression techniques results in savings of approximately 15% over the same algorithm on noncompressed data. For the word-RAM, we modified Andersson's sorting algorithm to handle variable-length keys. The resulting algorithm is faster than the standard Unix sort by a factor of 1.5X. Last, we used an order-preserving scheme that is within a constant additive term of the optimal Hu--Tucker, but requires linear time rather than O(mlog m), where m = |Σ| is the size of the alphabet.
- Andersson, A. 1994. Faster deterministic sorting and searching in linear space. In Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science (FOCS 1996). 135--141. Google ScholarDigital Library
- Andersson, A. and Nilsson, S. 1994. A new efficient radix sort. In FOCS: IEEE Symposium on Foundations of Computer Science (FOCS).Google Scholar
- Andersson, A. and Nilsson, S. 1998. Implementing radixsort. ACM Journal of Experimental Algorithms 3, 7. Google ScholarDigital Library
- Andersson, A., Hagerup, T., Nilsson, S., and Raman, R. 1995. Sorting in linear time? In STOC: ACM Symposium on Theory of Computing (STOC). Google ScholarDigital Library
- Antoshenkov, G. 1997. Dictionary-based order-preserving string compression. VLDB Journal: Very Large Data Bases 6, 1 (Jan.), 26--39. (Electronic edition.) Google ScholarDigital Library
- Bayer, P. J. 1975. Improved bounds on the costs of optimal and balanced binary search trees. Master's thesis. Massachussets Institute of Technology (MIT), Cambridge, MA.Google Scholar
- Bell, T. C., Cleary, J. G., and Witten, I. H. 1990. Text compression. Prentice Hall, Englewood Cliffs, NJ. Google ScholarDigital Library
- Bentley, J. L. and Sedgewick, R. 1997. Fast algorithms for sorting and searching strings. In Proceedings of 8th ACM-SIAM Symposium on Discrete Algorithms (SODA '97). 360--369. Google ScholarDigital Library
- Farach, M. and Thorup, M. 1998. String matching in lempel-ziv compressed strings. Algorithmica 20, 4, 388--404.Google ScholarCross Ref
- Gilbert, E. N. and Moore, E. F. 1959. Variable-length binary encoding. Bell Systems Technical Journal 38, 933--968.Google ScholarCross Ref
- Hu, T. C. 1973. A new proof of the T-C algorithm. SIAM Journal on Applied Mathematics 25, 1 (July), 83--94.Google ScholarCross Ref
- Kärkkäinen, J. and Ukknonen, E. 1996. Lempel-ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing (WSP '96). 141--155.Google Scholar
- Knuth, D. E. 1997. The art of computer programming: Fundamental algorithms, 3rd ed, vol. 1. Addison--Wesley, Reading, MA. Google ScholarDigital Library
- Larmore, L. L. and Przytycka, T. M. 1998. The optimal alphabetic tree problem revisited. Journal of Algorithms 28, 1 (July), 1--20. Google ScholarDigital Library
- Moura, E., Navarro, G., and Ziviani, N. 1997. Indexing compressed text. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). Carleton University Press, Ottawa, Ontario. 95--111.Google Scholar
- Mumey, B. M. 1992. Some new results on constructing optimal alphabetic binary trees. Master's thesis, University of British Columbia, Vancouver, British Columbia.Google Scholar
- Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Technical Journal 27, 379--423, 623--656.Google ScholarCross Ref
- Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory, Vol.IT-24 5.Google Scholar
Index Terms
- Fast string sorting using order-preserving compression
Recommendations
Tighter upper bound for sorting permutations with prefix transpositions
Permutations are sequences where each symbol in the given alphabet Σ appears exactly once. A transposition is an operation that exchanges two adjacent sublists in a permutation; if one of these sublists is restricted to be a prefix then one obtains a ...
Faster deterministic sorting and searching in linear space
FOCS '96: Proceedings of the 37th Annual Symposium on Foundations of Computer ScienceWe present a significant improvement on linear space deterministic sorting and searching. On a unit-cost RAM with word size w, an ordered set of n w-bit keys (viewed as binary strings or integers) can be maintained in O(min{[/spl radic/(logn)][logn/logw+...
Towards optimal packed string matching
In the packed string matching problem, it is assumed that each machine word can accommodate up to @a characters, thus an n-character string occupies n/@a memory words. (a) We extend the Crochemore-Perrin constant-space O(n)-time string-matching ...
Comments