Abstract
Real-time search requires to incrementally ingest content updates and almost immediately make them searchable while serving search queries at low latency. This is currently feasible for datasets of moderate size by fully maintaining the index in the main memory of multiple machines. Instead, disk-based methods for incremental index maintenance substantially increase search latency with the index fragmented across multiple disk locations. For the support of fast search over disk-based storage, we take a fresh look at incremental text indexing in the context of current architectural features. We introduce a greedy method called Selective Range Flush (SRF) to contiguously organize the index over disk blocks and dynamically update it at low cost. We show that SRF requires substantial experimental effort to tune specific parameters for performance efficiency. Subsequently, we propose the Unified Range Flush (URF) method, which is conceptually simpler than SRF, achieves similar or better performance with fewer parameters and less tuning, and is amenable to I/O complexity analysis. We implement interesting variations of the two methods in the Proteus prototype search engine that we developed and do extensive experiments with three different Web datasets of size up to 1TB. Across different systems, we show that our methods offer search latency that matches or reduces up to half the lowest achieved by existing disk-based methods. In comparison to an existing method of comparable search latency on the same system, our methods reduce by a factor of 2.0--2.4 the I/O part of build time and by 21--24% the total build time.
- Vo Ngoc Anh and Alistair Moffat. 2006. Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 372--379. Google ScholarDigital Library
- Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. 2001. Searching the web. ACM Trans. Internet Technol. 1, 1, 2--43. Google ScholarDigital Library
- Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, and Fabrizio Silvestri. 2007a. Challenges on distributed web retrieval. In Proceedings of the IEEE International Conference on Data Engineering. 6--20.Google ScholarCross Ref
- Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007b. The impact of caching on search engines. In Proceedings of the 30th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 183--190. Google ScholarDigital Library
- Luiz Andre Barroso, Jeffrey Dean, and Urs Holzle. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 23, 2, 22--28. Google ScholarDigital Library
- Alexandros Batsakis, Randal Burns, Arkady Kanevsky, James Lentini, and Thomas Talpey. 2008. AWOL: An adaptive write optimizations layer. In Proceedings of the USENIX Conference on File and Storage Technologies. 67--80. Google ScholarDigital Library
- Truls A. Bjorklund, Michaela Gotz, and Johannes Gerhke. 2010. Search in social networks with access control. In Proceedings of the International Workshop on Keyword Search on Structured Data. ACM Press, New York, 4:1--4:6. Google ScholarDigital Library
- Allan Borodin and Ran El-Yaniv. 1998. Online Computation and Competitive Analysis. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- Eric A. Brewer. 2005. Combining systems and databases: A search engine retrospective. In Readings in Database Systems, 4th Ed., Joseph M. Hellerstein and Michael Stonebraker, Eds., MIT Press, Cambridge, MA, 711--724.Google Scholar
- Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the ACM Conference on Information and Knowledge Management. 426--434. Google ScholarDigital Library
- Eric W. Brown, James P. Callan, and W. Bruce Croft. 1994. Fast incremental indexing for full-text information retrieval. In Proceedings of the 20th International Conference on Very Large Data Bases. 192--202. Google ScholarDigital Library
- Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. 2012. Earlybird: Real-time search at twitter. In Proceedings of the IEEE International Conference on Data Engineering. 1360--1369. Google ScholarDigital Library
- Stefan Buttcher and Charles L. A. Clarke. 2008. Hybrid index maintenance for contiguous inverted lists. Inf. Retr. 11, 3, 197--207. Google ScholarDigital Library
- Stefan Buttcher, Charles L. A. Clarke, and Brad Lushman. 2006a. A hybrid approach to index maintenance in dynamic text retrieval systems. In Proceedings of the European Conference on IR Research. 229--240. Google ScholarDigital Library
- Stefan Buttcher, Charles L. A. Clarke, and Brad Lushman. 2006b. Hybrid index maintenance for growing text collections. In Proceedings of the 29th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 356--363. Google ScholarDigital Library
- Marjan Celikik and Hannah Bast. 2009. Fast single-pass construction of a half-inverted index. In Proceedings of the International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 5721, Springer, 194--205. Google ScholarDigital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 205--218. Google ScholarDigital Library
- Chun Chen, Feng Li, Beng Chin Ooi, and Sai Wu. 2011. TI: An efficient indexing mechanism for real-time search on tweets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 649--660. Google ScholarDigital Library
- Tzicker Chiueh and Lan Huang. 1999. Efficient real-time index updates in text retrieval systems. Tech. rep. 66, ECSL, Stony Brook University, Stony Brook, NY.Google Scholar
- ClueWeb. 2009. The ClueWeb09 dataset. http://boston.lti.cs.cmu.edu/Data/clueweb09/.Google Scholar
- Doug Cutting and Jan Pedersen. 1990. Optimizations for dynamic inverted index maintenance. In Proceedings of the 13th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 405--411. Google ScholarDigital Library
- Jeffrey Dean and Luiz Andre Barroso. 2013. The tail at scale. Comm. ACM 56, 2, 74--80. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51, 1, 107--113. Google ScholarDigital Library
- David Geer. 2010. Is it really time for real-time search? Comput. 43, 3, 16--19. Google ScholarDigital Library
- Ruijie Guo, Xueqi Cheng, Hongbo Xu, and Bin Wang. 2007. Efficient on-line index maintenance for dynamic text collections by using dynamic balancing tree. In Proceedings of the ACM Conference on Information and Knowledge Management. 751--759. Google ScholarDigital Library
- Sairam Gurajada and Sreenivasa Kumar. 2009. On-line index maintenance using horizontal partitioning. In Proceedings of the ACM Conference on Information and Knowledge Management. 435--444. Google ScholarDigital Library
- Steffen Heinz and Justin Zobel. 2003. Efficient single-pass index construction for text databases. J. Amer. Soc. Inf. Sci. Technol. 54, 8, 713--729. Google ScholarDigital Library
- Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. 2000. WebBase: A repository of web pages. Comput. Netw. 33, 1--6, 277--293. Google ScholarDigital Library
- Raj Jain. 1991. The Art of Computer Systems Performance Analysis. Wiley.Google Scholar
- Myeongjae Jeon, Yuxiong He, Sameh Elnikety, Alan L. Cox, and Scott Rixner. 2013. Adaptive parallelism for web search. In Proceedings of the European Conference on Computer Systems. 155--168. Google ScholarDigital Library
- Florian Leibert, Jake Mannix, Jimmy Lin, and Babak Hamadani. 2011. Automatic management of partitioned, replicated search services. In Proceedings of the ACM Symposium on Cloud Computing. 27:1--27:8. Google ScholarDigital Library
- Ronnu Lempel, Yosi Mass, Shila Ofek-Koifman, Yael Petruschka, Dafna Sheinwald, and Ron Sivan. 2007. Just in time indexing for up to the second search. In Proceedings of the ACM Conference on Information and Knowledge Management. 97--106. Google ScholarDigital Library
- Nicholas Lester, Alistair Moffat, and Justin Zobel. 2005. Fast on-line index construction by geometric partitioning. In Proceedings of the ACM Conference on Information and Knowledge Management. 776--783. Google ScholarDigital Library
- Nicholas Lester, Alistair Moffat, and Justin Zobel. 2008. Efficient online index construction for text databases. ACM Trans. Database Syst. 33, 3, 1--33. Google ScholarDigital Library
- Nicholas Lester, Justin Zobel, and Hugh Williams. 2006. Efficient online index maintenance for contiguous inverted lists. Inf. Process. Manag. 42, 4, 916--933. Google ScholarDigital Library
- Nicholas Lester, Justin Zobel, and Hugh Williams. 2004. In-place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. In Proceedings of the Australasian Computer Science Conference. 15--23. Google ScholarDigital Library
- Andrew W. Leung, Minglong Shao, Timothy Bisson, Shankar Pasupathy, and Ethan L. Miller. 2009. Spyglass: Fast, scalable metadata search for large-scale storage systems. In Proceedings of the USENIX Conference on File and Storage Technologies. 153--166. Google ScholarDigital Library
- Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, and Ramesh Agarwal. 2003. Dynamic maintenance of web indexes using landmarks. In Proceedings of the Conference on World Wide Web. 102--111. Google ScholarDigital Library
- Dionysios Logothetis, Christopher Olston, Benjamin Reed, Kevin C. Webb, and Keb Yocum. 2010. Stateful bulk processing for incremental analytics. In Proceedings of the ACM Symposium on Cloud Computing. 51--62. Google ScholarDigital Library
- Mike Mammarella, Shant Hovsepian, and Eddie Kohler. 2009. Modular data storage with Anvil. In Proceedings of the ACM Symposium on Operating Systems Principles. 147--160. Google ScholarDigital Library
- Giorgos Margaritis and Stergios V. Anastasiadis. 2009. Low-cost management of inverted files for online full-text search. In Proceedings of the ACM Conference on Information and Knowledge Management. 455--464. Google ScholarDigital Library
- Michael Mccandless, Erik Hatcher, and Otis Gospodnetic. 2010. Lucene in Action. Manning Publications, Stamford, CT.Google Scholar
- Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. 2001. Building a distributed full-text index for the web. ACM Trans. Inf. Syst. 19, 3, 217--241. Google ScholarDigital Library
- Alexandros Ntoulas and Junghoo Cho. 2007. Pruning policies for two-tiered inverted index with correctness guarantee. In Proceedings of the 30th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 191--198. Google ScholarDigital Library
- R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed prefetching and caching. In Proceedings of the ACM Symposium on Operating Systems Principles. 79--95. Google ScholarDigital Library
- Daniel Peng and Frank Dabek. 2010. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 251--264. Google ScholarDigital Library
- Martin F. Porter. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarCross Ref
- Mohit Saxena, Michael M. Swift, and Yiying Zhang. 2012. FlashTier: A lightweight, consistent and durable storage cache. In Proceedings of the ACM European Conference on Computer Systems. 267--280. Google ScholarDigital Library
- Sam Shah, Craig A. N. Soules, Gregory R. Ganger, and Brian D. Noble. 2007. Using provenance to aid in personal file search. In Proceedings of the USENIX Annual Technical Conference. 171--184. Google ScholarDigital Library
- Trevor Strohman and W. Bruce Croft. 2007. Efficient document retrieval in main memory. In Proceedings of the 30th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR'07). 175--182. Google ScholarDigital Library
- Anthony Tomasic, Hector Garcia-Molina, and Kurt Shoens. 1994. Incremental updates of inverted lists for text document retrieval. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 289--300. Google ScholarDigital Library
- Trec. 2006. TREC terabyte track. National Institute of Standards and Technology. http://trec.nist.gov/data/terabyte.html.Google Scholar
- Zheng Wei and Joseph Jaja. 2012. An optimized high-throughput strategy for constructing inverted files. IEEE Trans. Parallel Distrib. Syst. 23, 11, 2033--2044. Google ScholarDigital Library
- Wikipedia. 2008. The wikipedia dataset. http://static.wikipedia.org/downloads/2008-06/en/.Google Scholar
- Hugh E. Williams, Justin Zobel, and Dirk Bahle. 2004. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. 22, 4, 573--594. Google ScholarDigital Library
- Wumpus. 2011. Wumpus search engine. http://www.wumpus-search.org.Google Scholar
- Zettair. 2009. The Zettair search engine. RMIT university. http://www.seg.rmit.edu.au/zettair/.Google Scholar
- Mingjie Zhu, Shuming Shi, Nenghai Yu, and Ji-Rong Wen. 2008. Can phrase indexing help to process non-phrase queries. In Proceedings of the ACM Conference on Information and Knowledge Management. 679--688. Google ScholarDigital Library
- Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. Comput. Surv. 38, 2, 6:1--6:56. Google ScholarDigital Library
- Justin Zobel, Alistair Moffat, and Ron Sacks-Davis. 1993. Storage management for files of dynamic records. In Proceedings of the Australian Database Conference. 26--38.Google Scholar
- Zoie. 2008. Zoie real-time search and indexing system built on Apache Lucene. http://code.google.com/p/zoie/wiki/ZoieMergePolicy.Google Scholar
Index Terms
- Incremental Text Indexing for Fast Disk-Based Search
Recommendations
Inverted files versus signature files for text indexing
Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We ...
Low-cost management of inverted files for online full-text search
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementIn dynamic environments with frequent content updates, we require online full-text search that scales to large data collections and achieves low search latency. Several recent methods that support fast incremental indexing of documents typically keep on ...
Higher reliability redundant disk arrays: Organization, operation, and coding
Parity is a popular form of data protection in redundant arrays of inexpensive/independent disks (RAID). RAID5 dedicates one out of N disks to parity to mask single disk failures, that is, the contents of a block on a failed disk can be reconstructed by ...
Comments