research-article

Optimizing top-k document retrieval strategies for block-max indexes

Authors:
Constantinos Dimopoulos

Polytechnic Institute of NYU, Brooklyn, NY, USA

Polytechnic Institute of NYU, Brooklyn, NY, USA
View Profile

,
Sergey Nepomnyachiy

Polytechnic Institute of NYU, Brooklyn, NY, USA

Polytechnic Institute of NYU, Brooklyn, NY, USA
View Profile

,
Torsten Suel

Polytechnic Institute of NYU, Brooklyn, NY, USA

Polytechnic Institute of NYU, Brooklyn, NY, USA
View Profile

WSDM '13: Proceedings of the sixth ACM international conference on Web search and data miningFebruary 2013Pages 113–122https://doi.org/10.1145/2433396.2433412

Published:04 February 2013Publication History

WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

Pages 113–122

ABSTRACT

Large web search engines use significant hardware and energy resources to process hundreds of millions of queries each day, and a lot of research has focused on how to improve query processing efficiency. One general class of optimizations called early termination techniques is used in all major engines, and essentially involves computing top results without an exhaustive traversal and scoring of all potentially relevant index entries. Recent work in [9,7] proposed several early termination algorithms for disjunctive top-k query processing, based on a new augmented index structure called Block-Max Index that enables aggressive skipping in the index.

In this paper, we build on this work by studying new algorithms and optimizations for Block-Max indexes that achieve significant performance gains over the work in [9,7]. We start by implementing and comparing Block-Max oriented algorithms based on the well-known Maxscore and WAND approaches. Then we study how to build better Block-Max index structures and design better index-traversal strategies, resulting in new algorithms that achieve a factor of 2 speed-up over the best results in [9] with acceptable space overheads. We also describe and evaluate a hierarchical algorithm for a new recursive Block-Max index structure.

References

V. N. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with errective early termination. In Proceedings of the 24th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2001. Google ScholarDigital Library
V. N. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. In Proc. of the 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2006. Google ScholarDigital Library
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarDigital Library
H. Bast, D. Majumdar, R. Schenkel, M. Theobald, and G. Weikum. IO-Top-K: Index-access optimized top-k query processing. In Proceedings of the 32th International Conference on Very Large Data Bases, 2006. Google ScholarDigital Library
R. Blanco and A. Barreiro. Probabilistic static pruning of inverted files. ACM Transactions on Information Systems, 28(1), Jan. 2010. Google ScholarDigital Library
A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Y. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. of the 12th ACM Conf. on Information and Knowledge Management, 2003 Google ScholarDigital Library
K. Chakrabarti, S. Chaudhuri, and V. Ganti. Interval-based pruning for top-k processing over compressed lists. In Proc. of the 27th Int. Conf. on Data Engineering, 2011. Google ScholarDigital Library
J. Dean. Challenges in building large-scale information retrieval systems. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, 2009. Google ScholarDigital Library
S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. In Proc. of the 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2011. Google ScholarDigital Library
R. Fagin. Combining fuzzy information: an overview. SIGMOD Record, 31:2002, 2002. Google ScholarDigital Library
R. Fagin, D. Carmel, D. Cohen, E. Farchi, M. Herscovici, Y. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proceedings of the 24th Annual Int. ACM SIGIR Conference on Research and Development in Inf. Retrieval, 2001. Google ScholarDigital Library
S. Jonassen and S. E. Bratsberg. Efficient compressed inverted index skipping for disjunctive text-queries. In Proc. of the 33th European Conf. on Information Retrieval, 2011. Google ScholarDigital Library
X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proceedings of the 29th International Conference on Very Large Data Bases, 2003. Google ScholarDigital Library
M. Persin, J. Zobel, and R. Sacks-davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47:749--764, 1996. Google ScholarDigital Library
K. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. In 1st Latin American Web Congress, 2003. Google ScholarDigital Library
D. Shan, S. Ding, J. He, H. Yan, and X. Li. Optimized top-k processing with global page scores on block-max indexes. In Proc. of the Fifth Int. Conf. on Web Search and Data Mining, 2012. Google ScholarDigital Library
F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of the 29th European Conf. on Information Retrieval, 2007. Google ScholarDigital Library
F. Silvestri and R. Venturini. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In Proc. of the 19th ACM Conf. on Information and Knowledge Management, 2010. Google ScholarDigital Library
T. Strohman. Efficient Processing of Complex Features for Information Retrieval. PhD thesis, University of Massachusetts Amherst, 2007. Google ScholarDigital Library
T. Strohman and W. B. Croft. Efficient document retrieval in main memory. In Proc. of the 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2007. Google ScholarDigital Library
T. Strohman, H. R. Turtle, and W. B. Croft. Optimization strategies for complex queries. In Proc. of the 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2005. Google ScholarDigital Library
H. R. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Inf. Processing and Management, 31(6):831--850, 1995. Google ScholarDigital Library
L. Wang, J. J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In Proc. of the 34th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2011. Google ScholarDigital Library
. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition. Morgan Kaufmann, 1999. Google ScholarDigital Library
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. of the 18th Int. Conf. on World Wide Web, 2009. Google ScholarDigital Library
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. of the 17th Int. Conf. on World Wide Web, 2008. Google ScholarDigital Library
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006. Google ScholarDigital Library
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proceedings of the 22th Int. Conf. on Data Engineering, 2006. Google ScholarDigital Library

Index Terms

Optimizing top-k document retrieval strategies for block-max indexes
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Optimized top-k processing with global page scores on block-max indexes
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

Large web search engines are facing formidable performance challenges because they have to process thousands of queries per second on tens of billions of documents, within interactive response time. Among many others, Top-k query processing (also called ...
Read More
Faster top-k document retrieval using block-max indexes
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by ...
Read More
A candidate filtering mechanism for fast top-k query processing on modern cpus
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

A large amount of research has focused on faster methods for finding top-k results in large document collections, one of the main scalability challenges for web search engines. In this paper, we propose a method for accelerating such top-k queries that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining
February 2013
816 pages
ISBN:9781450318693
DOI:10.1145/2433396
General Chairs:
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy
,
Program Chairs:
Paolo Ferragina
University of Pisa, Italy
,
Aristides Gionis
Yahoo! Research, Barcelona, Spain
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 February 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
block-max inverted index
docid-oriented block-max index
early termination
top-k query processing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate498of2,863submissions,17%
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 52
  Total Citations
  View Citations
- 473
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing top-k document retrieval strategies for block-max indexes

WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimized top-k processing with global page scores on block-max indexes

Faster top-k document retrieval using block-max indexes

A candidate filtering mechanism for fast top-k query processing on modern cpus