skip to main content
10.1145/2433396.2433412acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Optimizing top-k document retrieval strategies for block-max indexes

Published:04 February 2013Publication History

ABSTRACT

Large web search engines use significant hardware and energy resources to process hundreds of millions of queries each day, and a lot of research has focused on how to improve query processing efficiency. One general class of optimizations called early termination techniques is used in all major engines, and essentially involves computing top results without an exhaustive traversal and scoring of all potentially relevant index entries. Recent work in [9,7] proposed several early termination algorithms for disjunctive top-k query processing, based on a new augmented index structure called Block-Max Index that enables aggressive skipping in the index.

In this paper, we build on this work by studying new algorithms and optimizations for Block-Max indexes that achieve significant performance gains over the work in [9,7]. We start by implementing and comparing Block-Max oriented algorithms based on the well-known Maxscore and WAND approaches. Then we study how to build better Block-Max index structures and design better index-traversal strategies, resulting in new algorithms that achieve a factor of 2 speed-up over the best results in [9] with acceptable space overheads. We also describe and evaluate a hierarchical algorithm for a new recursive Block-Max index structure.

References

  1. V. N. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with errective early termination. In Proceedings of the 24th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. N. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. In Proc. of the 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Bast, D. Majumdar, R. Schenkel, M. Theobald, and G. Weikum. IO-Top-K: Index-access optimized top-k query processing. In Proceedings of the 32th International Conference on Very Large Data Bases, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Blanco and A. Barreiro. Probabilistic static pruning of inverted files. ACM Transactions on Information Systems, 28(1), Jan. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Y. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. of the 12th ACM Conf. on Information and Knowledge Management, 2003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Chakrabarti, S. Chaudhuri, and V. Ganti. Interval-based pruning for top-k processing over compressed lists. In Proc. of the 27th Int. Conf. on Data Engineering, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dean. Challenges in building large-scale information retrieval systems. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proc. of the 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Fagin. Combining fuzzy information: an overview. SIGMOD Record, 31:2002, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Fagin, D. Carmel, D. Cohen, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proceedings of the 24th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Jonassen and S. E. Bratsberg. Efficient compressed inverted index skipping for disjunctive text-queries. In Proc. of the 33th European Conf. on Information Retrieval, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Persin, J. Zobel, and R. Sacks-davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47:749--764, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. In 1st Latin American Web Congress, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Shan, S. Ding, J. He, H. Yan, and X. Li. Optimized top-k processing with global page scores on block-max indexes. In Proc. of the Fifth Int. Conf. on Web Search and Data Mining, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of the 29th European Conf. on Information Retrieval, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Silvestri and R. Venturini. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In Proc. of the 19th ACM Conf. on Information and Knowledge Management, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Strohman. Efficient Processing of Complex Features for Information Retrieval. PhD thesis, University of Massachusetts Amherst, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Strohman and W. B. Croft. Efficient document retrieval in main memory. In Proc. of the 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Strohman, H. R. Turtle, and W. B. Croft. Optimization strategies for complex queries. In Proc. of the 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. R. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Inf. Processing and Management, 31(6):831--850, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Wang, J. J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In Proc. of the 34th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. . H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition. Morgan Kaufmann, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. of the 18th Int. Conf. on World Wide Web, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. of the 17th Int. Conf. on World Wide Web, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proceedings of the 22th Int. Conf. on Data Engineering, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing top-k document retrieval strategies for block-max indexes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining
      February 2013
      816 pages
      ISBN:9781450318693
      DOI:10.1145/2433396

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 February 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader