skip to main content
research-article

Incremental Text Indexing for Fast Disk-Based Search

Published:08 July 2014Publication History
Skip Abstract Section

Abstract

Real-time search requires to incrementally ingest content updates and almost immediately make them searchable while serving search queries at low latency. This is currently feasible for datasets of moderate size by fully maintaining the index in the main memory of multiple machines. Instead, disk-based methods for incremental index maintenance substantially increase search latency with the index fragmented across multiple disk locations. For the support of fast search over disk-based storage, we take a fresh look at incremental text indexing in the context of current architectural features. We introduce a greedy method called Selective Range Flush (SRF) to contiguously organize the index over disk blocks and dynamically update it at low cost. We show that SRF requires substantial experimental effort to tune specific parameters for performance efficiency. Subsequently, we propose the Unified Range Flush (URF) method, which is conceptually simpler than SRF, achieves similar or better performance with fewer parameters and less tuning, and is amenable to I/O complexity analysis. We implement interesting variations of the two methods in the Proteus prototype search engine that we developed and do extensive experiments with three different Web datasets of size up to 1TB. Across different systems, we show that our methods offer search latency that matches or reduces up to half the lowest achieved by existing disk-based methods. In comparison to an existing method of comparable search latency on the same system, our methods reduce by a factor of 2.0--2.4 the I/O part of build time and by 21--24% the total build time.

References

  1. Vo Ngoc Anh and Alistair Moffat. 2006. Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 372--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. 2001. Searching the web. ACM Trans. Internet Technol. 1, 1, 2--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, and Fabrizio Silvestri. 2007a. Challenges on distributed web retrieval. In Proceedings of the IEEE International Conference on Data Engineering. 6--20.Google ScholarGoogle ScholarCross RefCross Ref
  4. Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007b. The impact of caching on search engines. In Proceedings of the 30th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 183--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Luiz Andre Barroso, Jeffrey Dean, and Urs Holzle. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 23, 2, 22--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Alexandros Batsakis, Randal Burns, Arkady Kanevsky, James Lentini, and Thomas Talpey. 2008. AWOL: An adaptive write optimizations layer. In Proceedings of the USENIX Conference on File and Storage Technologies. 67--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Truls A. Bjorklund, Michaela Gotz, and Johannes Gerhke. 2010. Search in social networks with access control. In Proceedings of the International Workshop on Keyword Search on Structured Data. ACM Press, New York, 4:1--4:6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Allan Borodin and Ran El-Yaniv. 1998. Online Computation and Competitive Analysis. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Eric A. Brewer. 2005. Combining systems and databases: A search engine retrospective. In Readings in Database Systems, 4th Ed., Joseph M. Hellerstein and Michael Stonebraker, Eds., MIT Press, Cambridge, MA, 711--724.Google ScholarGoogle Scholar
  10. Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the ACM Conference on Information and Knowledge Management. 426--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Eric W. Brown, James P. Callan, and W. Bruce Croft. 1994. Fast incremental indexing for full-text information retrieval. In Proceedings of the 20th International Conference on Very Large Data Bases. 192--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. 2012. Earlybird: Real-time search at twitter. In Proceedings of the IEEE International Conference on Data Engineering. 1360--1369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Stefan Buttcher and Charles L. A. Clarke. 2008. Hybrid index maintenance for contiguous inverted lists. Inf. Retr. 11, 3, 197--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Stefan Buttcher, Charles L. A. Clarke, and Brad Lushman. 2006a. A hybrid approach to index maintenance in dynamic text retrieval systems. In Proceedings of the European Conference on IR Research. 229--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Stefan Buttcher, Charles L. A. Clarke, and Brad Lushman. 2006b. Hybrid index maintenance for growing text collections. In Proceedings of the 29th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 356--363. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Marjan Celikik and Hannah Bast. 2009. Fast single-pass construction of a half-inverted index. In Proceedings of the International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 5721, Springer, 194--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 205--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Chun Chen, Feng Li, Beng Chin Ooi, and Sai Wu. 2011. TI: An efficient indexing mechanism for real-time search on tweets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 649--660. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tzicker Chiueh and Lan Huang. 1999. Efficient real-time index updates in text retrieval systems. Tech. rep. 66, ECSL, Stony Brook University, Stony Brook, NY.Google ScholarGoogle Scholar
  20. ClueWeb. 2009. The ClueWeb09 dataset. http://boston.lti.cs.cmu.edu/Data/clueweb09/.Google ScholarGoogle Scholar
  21. Doug Cutting and Jan Pedersen. 1990. Optimizations for dynamic inverted index maintenance. In Proceedings of the 13th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 405--411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jeffrey Dean and Luiz Andre Barroso. 2013. The tail at scale. Comm. ACM 56, 2, 74--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51, 1, 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. David Geer. 2010. Is it really time for real-time search? Comput. 43, 3, 16--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ruijie Guo, Xueqi Cheng, Hongbo Xu, and Bin Wang. 2007. Efficient on-line index maintenance for dynamic text collections by using dynamic balancing tree. In Proceedings of the ACM Conference on Information and Knowledge Management. 751--759. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sairam Gurajada and Sreenivasa Kumar. 2009. On-line index maintenance using horizontal partitioning. In Proceedings of the ACM Conference on Information and Knowledge Management. 435--444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Steffen Heinz and Justin Zobel. 2003. Efficient single-pass index construction for text databases. J. Amer. Soc. Inf. Sci. Technol. 54, 8, 713--729. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. 2000. WebBase: A repository of web pages. Comput. Netw. 33, 1--6, 277--293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Raj Jain. 1991. The Art of Computer Systems Performance Analysis. Wiley.Google ScholarGoogle Scholar
  30. Myeongjae Jeon, Yuxiong He, Sameh Elnikety, Alan L. Cox, and Scott Rixner. 2013. Adaptive parallelism for web search. In Proceedings of the European Conference on Computer Systems. 155--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Florian Leibert, Jake Mannix, Jimmy Lin, and Babak Hamadani. 2011. Automatic management of partitioned, replicated search services. In Proceedings of the ACM Symposium on Cloud Computing. 27:1--27:8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ronnu Lempel, Yosi Mass, Shila Ofek-Koifman, Yael Petruschka, Dafna Sheinwald, and Ron Sivan. 2007. Just in time indexing for up to the second search. In Proceedings of the ACM Conference on Information and Knowledge Management. 97--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nicholas Lester, Alistair Moffat, and Justin Zobel. 2005. Fast on-line index construction by geometric partitioning. In Proceedings of the ACM Conference on Information and Knowledge Management. 776--783. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Nicholas Lester, Alistair Moffat, and Justin Zobel. 2008. Efficient online index construction for text databases. ACM Trans. Database Syst. 33, 3, 1--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Nicholas Lester, Justin Zobel, and Hugh Williams. 2006. Efficient online index maintenance for contiguous inverted lists. Inf. Process. Manag. 42, 4, 916--933. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Nicholas Lester, Justin Zobel, and Hugh Williams. 2004. In-place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. In Proceedings of the Australasian Computer Science Conference. 15--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Andrew W. Leung, Minglong Shao, Timothy Bisson, Shankar Pasupathy, and Ethan L. Miller. 2009. Spyglass: Fast, scalable metadata search for large-scale storage systems. In Proceedings of the USENIX Conference on File and Storage Technologies. 153--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, and Ramesh Agarwal. 2003. Dynamic maintenance of web indexes using landmarks. In Proceedings of the Conference on World Wide Web. 102--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Dionysios Logothetis, Christopher Olston, Benjamin Reed, Kevin C. Webb, and Keb Yocum. 2010. Stateful bulk processing for incremental analytics. In Proceedings of the ACM Symposium on Cloud Computing. 51--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mike Mammarella, Shant Hovsepian, and Eddie Kohler. 2009. Modular data storage with Anvil. In Proceedings of the ACM Symposium on Operating Systems Principles. 147--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Giorgos Margaritis and Stergios V. Anastasiadis. 2009. Low-cost management of inverted files for online full-text search. In Proceedings of the ACM Conference on Information and Knowledge Management. 455--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Michael Mccandless, Erik Hatcher, and Otis Gospodnetic. 2010. Lucene in Action. Manning Publications, Stamford, CT.Google ScholarGoogle Scholar
  43. Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. 2001. Building a distributed full-text index for the web. ACM Trans. Inf. Syst. 19, 3, 217--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Alexandros Ntoulas and Junghoo Cho. 2007. Pruning policies for two-tiered inverted index with correctness guarantee. In Proceedings of the 30th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 191--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed prefetching and caching. In Proceedings of the ACM Symposium on Operating Systems Principles. 79--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Daniel Peng and Frank Dabek. 2010. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 251--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Martin F. Porter. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarGoogle ScholarCross RefCross Ref
  48. Mohit Saxena, Michael M. Swift, and Yiying Zhang. 2012. FlashTier: A lightweight, consistent and durable storage cache. In Proceedings of the ACM European Conference on Computer Systems. 267--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Sam Shah, Craig A. N. Soules, Gregory R. Ganger, and Brian D. Noble. 2007. Using provenance to aid in personal file search. In Proceedings of the USENIX Annual Technical Conference. 171--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Trevor Strohman and W. Bruce Croft. 2007. Efficient document retrieval in main memory. In Proceedings of the 30th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR'07). 175--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Anthony Tomasic, Hector Garcia-Molina, and Kurt Shoens. 1994. Incremental updates of inverted lists for text document retrieval. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 289--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Trec. 2006. TREC terabyte track. National Institute of Standards and Technology. http://trec.nist.gov/data/terabyte.html.Google ScholarGoogle Scholar
  53. Zheng Wei and Joseph Jaja. 2012. An optimized high-throughput strategy for constructing inverted files. IEEE Trans. Parallel Distrib. Syst. 23, 11, 2033--2044. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Wikipedia. 2008. The wikipedia dataset. http://static.wikipedia.org/downloads/2008-06/en/.Google ScholarGoogle Scholar
  55. Hugh E. Williams, Justin Zobel, and Dirk Bahle. 2004. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. 22, 4, 573--594. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Wumpus. 2011. Wumpus search engine. http://www.wumpus-search.org.Google ScholarGoogle Scholar
  57. Zettair. 2009. The Zettair search engine. RMIT university. http://www.seg.rmit.edu.au/zettair/.Google ScholarGoogle Scholar
  58. Mingjie Zhu, Shuming Shi, Nenghai Yu, and Ji-Rong Wen. 2008. Can phrase indexing help to process non-phrase queries. In Proceedings of the ACM Conference on Information and Knowledge Management. 679--688. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. Comput. Surv. 38, 2, 6:1--6:56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Justin Zobel, Alistair Moffat, and Ron Sacks-Davis. 1993. Storage management for files of dynamic records. In Proceedings of the Australian Database Conference. 26--38.Google ScholarGoogle Scholar
  61. Zoie. 2008. Zoie real-time search and indexing system built on Apache Lucene. http://code.google.com/p/zoie/wiki/ZoieMergePolicy.Google ScholarGoogle Scholar

Index Terms

  1. Incremental Text Indexing for Fast Disk-Based Search

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on the Web
            ACM Transactions on the Web  Volume 8, Issue 3
            June 2014
            256 pages
            ISSN:1559-1131
            EISSN:1559-114X
            DOI:10.1145/2639948
            Issue’s Table of Contents

            Copyright © 2014 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 8 July 2014
            • Accepted: 1 December 2013
            • Revised: 1 November 2013
            • Received: 1 September 2012
            Published in tweb Volume 8, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader