ACM Home Page
Please provide us with feedback. Feedback
On searching compressed string collections cache-obliviously
Full text pdf formatPdf (244 KB)
Source
Symposium on Principles of Database Systems archive
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems table of contents
Vancouver, Canada
SESSION: Searching & clustering table of contents
Pages 181-190  
Year of Publication: 2008
ISBN:978-1-60558-108-8
Authors
Paolo Ferragina  Università di Pisa, Pisa, Italy
Roberto Grossi  Università di Pisa, Pisa, Italy
Ankur Gupta  Butler University, Indianapolis, IN, USA
Rahul Shah  Louisiana State University, Baton Rouge, LA, USA
Jeffrey Scott Vitter  Purdue University, West Lafayette, IN, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 42,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376916.1376943
What is a DOI?

ABSTRACT

Current data structures for searching large string collections either fail to achieve minimum space or cause too many cache misses. In this paper we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and compressed. We provide new insights on front coding [24], introduce other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. Our second contribution is a novel dictionary encoding scheme that builds upon such linearizations and achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries [2, 5] and obtain a succinct variant whose space is close to the information-theoretic minimum.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
R. Bayer and K. Unterauer. Prefix B-trees. ACM Transactions on Database Systems, 2(1):11--26, 1977.
 
2
M. Bender, M. Farach-Colton, and B. Kuszmaul. Cache-oblivious string b-trees. In Proc. ACM PODS, 233--242, 2006.
 
3
D. Benoit, E. Demaine, I. Munro, R. Raman, V. Raman, and S. Rao. Representing trees of higher degree. Algorithmica, 43:275--292, 2005.
 
4
J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. ACM-SIAM SODA, 360--369, 1996.
 
5
G. S. Brodal and R. Fagerberg. Cache-oblivious string dictionaries. In Proc. ACM-SIAM SODA, 581--590, 2006.
 
6
V. Ciriani, P. Ferragina, F. Luccio, and S. Muthukrishnan. A data structure for a sequence of string accesses in external memory. ACM Transactions on Algorithms, 3(1), 2007.
 
7
P. Ferragina and R. Grossi. The string B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236--280, 1999.
 
8
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. IEEE FOCS, 184--193, 2005.
 
9
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and searching xml data via two zips. In Proc. WWW, 751--760, 2006.
 
10
P. Ferragina and R. Venturini. Compressed permuterm index. In Proc. ACM SIGIR, 535--542, 2007.
 
11
M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. IEEE FOCS, 285--298, 1999.
 
12
A. Golynski, R. Grossi, A. Gupta, R. Raman, and S. S. Rao. On the size of succinct indices. In Proc. ESA, LNCS 4698, 371--382, 2007.
 
13
M. He, J. I. Munro, and S. S. Rao. Succinct ordinal trees based on tree covering. In Proc. ICALP, LNCS 4596, 509--520, 2007.
 
14
G. Jacobson. Space-efficient static trees and graphs. In Proc. IEEE FOCS, 549--554, 1989.
 
15
J. Jansson, K. Sadakane, and W. Sung. Ultra-succinct representation of ordered trees. In Proc. ACM-SIAM SODA, 575--584, 2007.
 
16
D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, second edition, 1998.
 
17
P. Ko and S. Aluru. Optimal self-adjusting trees for dynamic string data in secondary storage. In Proc. SPIRE, LNCS 4726, 184--194, 2007.
 
18
G. Manku, A. Jain, and A.-D. Sarma. Detecting near-duplicates for web crawling. In Proc. WWW, 141--150, 2007.
 
19
K. Mehlhorn and A. K. Tsakalidis. Data structures. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A), 301--342, 1990.
 
20
J. I. Munro. Succinct data structures. Electr. Notes Theor. Comput. Sci., 91(3), 2004.
 
21
G. Navarro and V. Mäkinen. Compressed full text indexes. ACM Computing Surveys, 39(1), 2007.
 
22
R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. ACM-SIAM SODA, 233--242, 2002.
 
23
F. Ruskey. Combinatorial Generation, 2007. In preparation.
 
24
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, second edition, 1999.

Collaborative Colleagues:
Paolo Ferragina: colleagues
Roberto Grossi: colleagues
Ankur Gupta: colleagues
Rahul Shah: colleagues
Jeffrey Scott Vitter: colleagues