ABSTRACT
Large web search engines use significant hardware and energy resources to process hundreds of millions of queries each day, and a lot of research has focused on how to improve query processing efficiency. One general class of optimizations called early termination techniques is used in all major engines, and essentially involves computing top results without an exhaustive traversal and scoring of all potentially relevant index entries. Recent work in [9,7] proposed several early termination algorithms for disjunctive top-k query processing, based on a new augmented index structure called Block-Max Index that enables aggressive skipping in the index.
In this paper, we build on this work by studying new algorithms and optimizations for Block-Max indexes that achieve significant performance gains over the work in [9,7]. We start by implementing and comparing Block-Max oriented algorithms based on the well-known Maxscore and WAND approaches. Then we study how to build better Block-Max index structures and design better index-traversal strategies, resulting in new algorithms that achieve a factor of 2 speed-up over the best results in [9] with acceptable space overheads. We also describe and evaluate a hierarchical algorithm for a new recursive Block-Max index structure.
- V. N. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with errective early termination. In Proceedings of the 24th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2001. Google ScholarDigital Library
- V. N. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. In Proc. of the 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2006. Google ScholarDigital Library
- R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarDigital Library
- H. Bast, D. Majumdar, R. Schenkel, M. Theobald, and G. Weikum. IO-Top-K: Index-access optimized top-k query processing. In Proceedings of the 32th International Conference on Very Large Data Bases, 2006. Google ScholarDigital Library
- R. Blanco and A. Barreiro. Probabilistic static pruning of inverted files. ACM Transactions on Information Systems, 28(1), Jan. 2010. Google ScholarDigital Library
- A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Y. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. of the 12th ACM Conf. on Information and Knowledge Management, 2003 Google ScholarDigital Library
- K. Chakrabarti, S. Chaudhuri, and V. Ganti. Interval-based pruning for top-k processing over compressed lists. In Proc. of the 27th Int. Conf. on Data Engineering, 2011. Google ScholarDigital Library
- J. Dean. Challenges in building large-scale information retrieval systems. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, 2009. Google ScholarDigital Library
- S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proc. of the 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2011. Google ScholarDigital Library
- R. Fagin. Combining fuzzy information: an overview. SIGMOD Record, 31:2002, 2002. Google ScholarDigital Library
- R. Fagin, D. Carmel, D. Cohen, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proceedings of the 24th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2001. Google ScholarDigital Library
- S. Jonassen and S. E. Bratsberg. Efficient compressed inverted index skipping for disjunctive text-queries. In Proc. of the 33th European Conf. on Information Retrieval, 2011. Google ScholarDigital Library
- X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases, 2003. Google ScholarDigital Library
- M. Persin, J. Zobel, and R. Sacks-davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47:749--764, 1996. Google ScholarDigital Library
- K. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. In 1st Latin American Web Congress, 2003. Google ScholarDigital Library
- D. Shan, S. Ding, J. He, H. Yan, and X. Li. Optimized top-k processing with global page scores on block-max indexes. In Proc. of the Fifth Int. Conf. on Web Search and Data Mining, 2012. Google ScholarDigital Library
- F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of the 29th European Conf. on Information Retrieval, 2007. Google ScholarDigital Library
- F. Silvestri and R. Venturini. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In Proc. of the 19th ACM Conf. on Information and Knowledge Management, 2010. Google ScholarDigital Library
- T. Strohman. Efficient Processing of Complex Features for Information Retrieval. PhD thesis, University of Massachusetts Amherst, 2007. Google ScholarDigital Library
- T. Strohman and W. B. Croft. Efficient document retrieval in main memory. In Proc. of the 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2007. Google ScholarDigital Library
- T. Strohman, H. R. Turtle, and W. B. Croft. Optimization strategies for complex queries. In Proc. of the 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2005. Google ScholarDigital Library
- H. R. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Inf. Processing and Management, 31(6):831--850, 1995. Google ScholarDigital Library
- L. Wang, J. J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In Proc. of the 34th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2011. Google ScholarDigital Library
- . H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition. Morgan Kaufmann, 1999. Google ScholarDigital Library
- H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. of the 18th Int. Conf. on World Wide Web, 2009. Google ScholarDigital Library
- J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. of the 17th Int. Conf. on World Wide Web, 2008. Google ScholarDigital Library
- J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006. Google ScholarDigital Library
- M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proceedings of the 22th Int. Conf. on Data Engineering, 2006. Google ScholarDigital Library
Index Terms
- Optimizing top-k document retrieval strategies for block-max indexes
Recommendations
Optimized top-k processing with global page scores on block-max indexes
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data miningLarge web search engines are facing formidable performance challenges because they have to process thousands of queries per second on tens of billions of documents, within interactive response time. Among many others, Top-k query processing (also called ...
Faster top-k document retrieval using block-max indexes
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalLarge search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by ...
A candidate filtering mechanism for fast top-k query processing on modern cpus
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalA large amount of research has focused on faster methods for finding top-k results in large document collections, one of the main scalability challenges for web search engines. In this paper, we propose a method for accelerating such top-k queries that ...
Comments