skip to main content
10.1145/2063576.2063646acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Indexes for highly repetitive document collections

Published:24 October 2011Publication History

ABSTRACT

We introduce new compressed inverted indexes for highly repetitive document collections. They are based on run-length, Lempel-Ziv, or grammar-based compression of the differential inverted lists, instead of gap-encoding them as is the usual practice. We show that our compression methods significantly reduce the space achieved by classical compression, at the price of moderate slowdowns. Moreover, many of our methods are universal, that is, they do not need to know the versioning structure of the collection.

We also introduce compressed self-indexes in the comparison. We show that techniques can compress much further, using a small fraction of the space required by our new inverted indexes, yet they are orders of magnitude slower.

References

  1. V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8:151--166, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Anick and R. Flynn. Versioning a full-text information retrieval system. In SIGIR, pages 98--111, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Barbay, A. López-Ortiz, and T. Lu. Faster adaptive set intersections for text searching. In WEA, pages 146--157, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Brisaboa, A. Farina, G. Navarro, A. Places, and E. Rodr1guez. Self-indexing natural language. In SPIRE, pages 121--132, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. Shekita. Indexing shared content in information retrieval systems. In EDBT, pages 313--330, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Claude, A. Farina, M. Martínez-Prieto, and G. Navarro. Compressed q-gram indexing for highly repetitive biological sequences. In BIBE, pages 86--91, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. Claude and G. Navarro. Self-indexed text compression using straight-line programs. In MFCS, pages 235--246, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Culpepper and A. Moffat. Compact set representation for information retrieval. In SPIRE, pages 137--148, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Ding, J. Attenberg, and T. Suel. Scalable techniques for document identifier assignment in inverted indexes. In WWW, pages 311--320, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. González and G. Navarro. Compressed text indexes with fast locate. In CPM, pages 216--227, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In CIKM, pages 415--424, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. He, J. Zeng, and T. Suel. Improved index compression techniques for versioned document collections. In CIKM, pages 1239--1248, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Kreft and G. Navarro. LZ77-like compression with fast random access. In DCC, pages 239--248, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Kreft and G. Navarro. Self-indexing based on LZ77. In CPM, pages 41--54, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Larsson and A. Moffat. Off-line dictionary-based compression. Proc. IEEE, 88(11):1722--1732, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  16. V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comp. Biol., 17(3):281--308, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  17. E. Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Trans. Inf. Sys., 18(2):113--139, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. I. Munro. Tables. In FSTTCS, pages 37--42, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Comp. Surv., 39(1):article 2, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. In CPM, pages 20--31, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Sadakane. New text indexing functionalities of the compressed suffix arrays. J. Alg., 48(2):294--313, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Sanders and F. Transier. Intersection in integer inverted indices. In ALENEX, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  23. H. Williams and J. Zobel. Compressing integers for fast file access. The Comp. J., 42:193--201, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  24. J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337--343, 1977.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comp. Surv., 38(2): article 6, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, page 59, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Indexes for highly repetitive document collections

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in
                  • Published in

                    cover image ACM Conferences
                    CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
                    October 2011
                    2712 pages
                    ISBN:9781450307178
                    DOI:10.1145/2063576

                    Copyright © 2011 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 24 October 2011

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • research-article

                    Acceptance Rates

                    Overall Acceptance Rate1,861of8,427submissions,22%

                    Upcoming Conference

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader