| On searching compressed string collections cache-obliviously |
| Full text |
Pdf
(244 KB)
|
Source
|
Symposium on Principles of Database Systems
archive
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
table of contents
Vancouver, Canada
SESSION: Searching & clustering
table of contents
Pages 181-190
Year of Publication: 2008
ISBN:978-1-60558-108-8
|
|
Authors
|
|
Paolo Ferragina
|
Università di Pisa, Pisa, Italy
|
|
Roberto Grossi
|
Università di Pisa, Pisa, Italy
|
|
Ankur Gupta
|
Butler University, Indianapolis, IN, USA
|
|
Rahul Shah
|
Louisiana State University, Baton Rouge, LA, USA
|
|
Jeffrey Scott Vitter
|
Purdue University, West Lafayette, IN, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 13, Downloads (12 Months): 42, Citation Count: 0
|
|
|
ABSTRACT
Current data structures for searching large string collections either fail to achieve minimum space or cause too many cache misses. In this paper we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and compressed. We provide new insights on front coding [24], introduce other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. Our second contribution is a novel dictionary encoding scheme that builds upon such linearizations and achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries [2, 5] and obtain a succinct variant whose space is close to the information-theoretic minimum.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Bayer and K. Unterauer. Prefix B-trees. ACM Transactions on Database Systems, 2(1):11--26, 1977.
|
| |
2
|
M. Bender, M. Farach-Colton, and B. Kuszmaul. Cache-oblivious string b-trees. In Proc. ACM PODS, 233--242, 2006.
|
| |
3
|
D. Benoit, E. Demaine, I. Munro, R. Raman, V. Raman, and S. Rao. Representing trees of higher degree. Algorithmica, 43:275--292, 2005.
|
| |
4
|
J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. ACM-SIAM SODA, 360--369, 1996.
|
| |
5
|
G. S. Brodal and R. Fagerberg. Cache-oblivious string dictionaries. In Proc. ACM-SIAM SODA, 581--590, 2006.
|
| |
6
|
V. Ciriani, P. Ferragina, F. Luccio, and S. Muthukrishnan. A data structure for a sequence of string accesses in external memory. ACM Transactions on Algorithms, 3(1), 2007.
|
| |
7
|
P. Ferragina and R. Grossi. The string B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236--280, 1999.
|
| |
8
|
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. IEEE FOCS, 184--193, 2005.
|
| |
9
|
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and searching xml data via two zips. In Proc. WWW, 751--760, 2006.
|
| |
10
|
P. Ferragina and R. Venturini. Compressed permuterm index. In Proc. ACM SIGIR, 535--542, 2007.
|
| |
11
|
M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. IEEE FOCS, 285--298, 1999.
|
| |
12
|
A. Golynski, R. Grossi, A. Gupta, R. Raman, and S. S. Rao. On the size of succinct indices. In Proc. ESA, LNCS 4698, 371--382, 2007.
|
| |
13
|
M. He, J. I. Munro, and S. S. Rao. Succinct ordinal trees based on tree covering. In Proc. ICALP, LNCS 4596, 509--520, 2007.
|
| |
14
|
G. Jacobson. Space-efficient static trees and graphs. In Proc. IEEE FOCS, 549--554, 1989.
|
| |
15
|
J. Jansson, K. Sadakane, and W. Sung. Ultra-succinct representation of ordered trees. In Proc. ACM-SIAM SODA, 575--584, 2007.
|
| |
16
|
D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, second edition, 1998.
|
| |
17
|
P. Ko and S. Aluru. Optimal self-adjusting trees for dynamic string data in secondary storage. In Proc. SPIRE, LNCS 4726, 184--194, 2007.
|
| |
18
|
G. Manku, A. Jain, and A.-D. Sarma. Detecting near-duplicates for web crawling. In Proc. WWW, 141--150, 2007.
|
| |
19
|
K. Mehlhorn and A. K. Tsakalidis. Data structures. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A), 301--342, 1990.
|
| |
20
|
J. I. Munro. Succinct data structures. Electr. Notes Theor. Comput. Sci., 91(3), 2004.
|
| |
21
|
G. Navarro and V. Mäkinen. Compressed full text indexes. ACM Computing Surveys, 39(1), 2007.
|
| |
22
|
R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. ACM-SIAM SODA, 233--242, 2002.
|
| |
23
|
F. Ruskey. Combinatorial Generation, 2007. In preparation.
|
| |
24
|
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, second edition, 1999.
|
|