research-article

Incremental Text Indexing for Fast Disk-Based Search

Authors:
Giorgos Margaritis

University of Ioannina, Ioannina, Greece

University of Ioannina, Ioannina, Greece
View Profile

,
Stergios V. Anastasiadis

University of Ioannina, Ioannina, Greece

University of Ioannina, Ioannina, Greece
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 8 Issue 3Article No.: 16pp 1–31https://doi.org/10.1145/2560800

Published:08 July 2014Publication History

ACM Transactions on the Web

Abstract

Real-time search requires to incrementally ingest content updates and almost immediately make them searchable while serving search queries at low latency. This is currently feasible for datasets of moderate size by fully maintaining the index in the main memory of multiple machines. Instead, disk-based methods for incremental index maintenance substantially increase search latency with the index fragmented across multiple disk locations. For the support of fast search over disk-based storage, we take a fresh look at incremental text indexing in the context of current architectural features. We introduce a greedy method called Selective Range Flush (SRF) to contiguously organize the index over disk blocks and dynamically update it at low cost. We show that SRF requires substantial experimental effort to tune specific parameters for performance efficiency. Subsequently, we propose the Unified Range Flush (URF) method, which is conceptually simpler than SRF, achieves similar or better performance with fewer parameters and less tuning, and is amenable to I/O complexity analysis. We implement interesting variations of the two methods in the Proteus prototype search engine that we developed and do extensive experiments with three different Web datasets of size up to 1TB. Across different systems, we show that our methods offer search latency that matches or reduces up to half the lowest achieved by existing disk-based methods. In comparison to an existing method of comparable search latency on the same system, our methods reduce by a factor of 2.0--2.4 the I/O part of build time and by 21--24% the total build time.

References

Vo Ngoc Anh and Alistair Moffat. 2006. Pruned query evaluation using pre-computed impacts. In Proceedings of the 29^th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 372--379. Google ScholarDigital Library
Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. 2001. Searching the web. ACM Trans. Internet Technol. 1, 1, 2--43. Google ScholarDigital Library
Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, and Fabrizio Silvestri. 2007a. Challenges on distributed web retrieval. In Proceedings of the IEEE International Conference on Data Engineering. 6--20.Google ScholarCross Ref
Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007b. The impact of caching on search engines. In Proceedings of the 30^th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 183--190. Google ScholarDigital Library
Luiz Andre Barroso, Jeffrey Dean, and Urs Holzle. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 23, 2, 22--28. Google ScholarDigital Library
Alexandros Batsakis, Randal Burns, Arkady Kanevsky, James Lentini, and Thomas Talpey. 2008. AWOL: An adaptive write optimizations layer. In Proceedings of the USENIX Conference on File and Storage Technologies. 67--80. Google ScholarDigital Library
Truls A. Bjorklund, Michaela Gotz, and Johannes Gerhke. 2010. Search in social networks with access control. In Proceedings of the International Workshop on Keyword Search on Structured Data. ACM Press, New York, 4:1--4:6. Google ScholarDigital Library
Allan Borodin and Ran El-Yaniv. 1998. Online Computation and Competitive Analysis. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
Eric A. Brewer. 2005. Combining systems and databases: A search engine retrospective. In Readings in Database Systems, 4^th Ed., Joseph M. Hellerstein and Michael Stonebraker, Eds., MIT Press, Cambridge, MA, 711--724.Google Scholar
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the ACM Conference on Information and Knowledge Management. 426--434. Google ScholarDigital Library
Eric W. Brown, James P. Callan, and W. Bruce Croft. 1994. Fast incremental indexing for full-text information retrieval. In Proceedings of the 20^th International Conference on Very Large Data Bases. 192--202. Google ScholarDigital Library
Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. 2012. Earlybird: Real-time search at twitter. In Proceedings of the IEEE International Conference on Data Engineering. 1360--1369. Google ScholarDigital Library
Stefan Buttcher and Charles L. A. Clarke. 2008. Hybrid index maintenance for contiguous inverted lists. Inf. Retr. 11, 3, 197--207. Google ScholarDigital Library
Stefan Buttcher, Charles L. A. Clarke, and Brad Lushman. 2006a. A hybrid approach to index maintenance in dynamic text retrieval systems. In Proceedings of the European Conference on IR Research. 229--240. Google ScholarDigital Library
Stefan Buttcher, Charles L. A. Clarke, and Brad Lushman. 2006b. Hybrid index maintenance for growing text collections. In Proceedings of the 29^th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 356--363. Google ScholarDigital Library
Marjan Celikik and Hannah Bast. 2009. Fast single-pass construction of a half-inverted index. In Proceedings of the International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 5721, Springer, 194--205. Google ScholarDigital Library
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 205--218. Google ScholarDigital Library
Chun Chen, Feng Li, Beng Chin Ooi, and Sai Wu. 2011. TI: An efficient indexing mechanism for real-time search on tweets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 649--660. Google ScholarDigital Library
Tzicker Chiueh and Lan Huang. 1999. Efficient real-time index updates in text retrieval systems. Tech. rep. 66, ECSL, Stony Brook University, Stony Brook, NY.Google Scholar
ClueWeb. 2009. The ClueWeb09 dataset. http://boston.lti.cs.cmu.edu/Data/clueweb09/.Google Scholar
Doug Cutting and Jan Pedersen. 1990. Optimizations for dynamic inverted index maintenance. In Proceedings of the 13^th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 405--411. Google ScholarDigital Library
Jeffrey Dean and Luiz Andre Barroso. 2013. The tail at scale. Comm. ACM 56, 2, 74--80. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51, 1, 107--113. Google ScholarDigital Library
David Geer. 2010. Is it really time for real-time search&quest; Comput. 43, 3, 16--19. Google ScholarDigital Library
Ruijie Guo, Xueqi Cheng, Hongbo Xu, and Bin Wang. 2007. Efficient on-line index maintenance for dynamic text collections by using dynamic balancing tree. In Proceedings of the ACM Conference on Information and Knowledge Management. 751--759. Google ScholarDigital Library
Sairam Gurajada and Sreenivasa Kumar. 2009. On-line index maintenance using horizontal partitioning. In Proceedings of the ACM Conference on Information and Knowledge Management. 435--444. Google ScholarDigital Library
Steffen Heinz and Justin Zobel. 2003. Efficient single-pass index construction for text databases. J. Amer. Soc. Inf. Sci. Technol. 54, 8, 713--729. Google ScholarDigital Library
Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. 2000. WebBase: A repository of web pages. Comput. Netw. 33, 1--6, 277--293. Google ScholarDigital Library
Raj Jain. 1991. The Art of Computer Systems Performance Analysis. Wiley.Google Scholar
Myeongjae Jeon, Yuxiong He, Sameh Elnikety, Alan L. Cox, and Scott Rixner. 2013. Adaptive parallelism for web search. In Proceedings of the European Conference on Computer Systems. 155--168. Google ScholarDigital Library
Florian Leibert, Jake Mannix, Jimmy Lin, and Babak Hamadani. 2011. Automatic management of partitioned, replicated search services. In Proceedings of the ACM Symposium on Cloud Computing. 27:1--27:8. Google ScholarDigital Library
Ronnu Lempel, Yosi Mass, Shila Ofek-Koifman, Yael Petruschka, Dafna Sheinwald, and Ron Sivan. 2007. Just in time indexing for up to the second search. In Proceedings of the ACM Conference on Information and Knowledge Management. 97--106. Google ScholarDigital Library
Nicholas Lester, Alistair Moffat, and Justin Zobel. 2005. Fast on-line index construction by geometric partitioning. In Proceedings of the ACM Conference on Information and Knowledge Management. 776--783. Google ScholarDigital Library
Nicholas Lester, Alistair Moffat, and Justin Zobel. 2008. Efficient online index construction for text databases. ACM Trans. Database Syst. 33, 3, 1--33. Google ScholarDigital Library
Nicholas Lester, Justin Zobel, and Hugh Williams. 2006. Efficient online index maintenance for contiguous inverted lists. Inf. Process. Manag. 42, 4, 916--933. Google ScholarDigital Library
Nicholas Lester, Justin Zobel, and Hugh Williams. 2004. In-place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. In Proceedings of the Australasian Computer Science Conference. 15--23. Google ScholarDigital Library
Andrew W. Leung, Minglong Shao, Timothy Bisson, Shankar Pasupathy, and Ethan L. Miller. 2009. Spyglass: Fast, scalable metadata search for large-scale storage systems. In Proceedings of the USENIX Conference on File and Storage Technologies. 153--166. Google ScholarDigital Library
Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, and Ramesh Agarwal. 2003. Dynamic maintenance of web indexes using landmarks. In Proceedings of the Conference on World Wide Web. 102--111. Google ScholarDigital Library
Dionysios Logothetis, Christopher Olston, Benjamin Reed, Kevin C. Webb, and Keb Yocum. 2010. Stateful bulk processing for incremental analytics. In Proceedings of the ACM Symposium on Cloud Computing. 51--62. Google ScholarDigital Library
Mike Mammarella, Shant Hovsepian, and Eddie Kohler. 2009. Modular data storage with Anvil. In Proceedings of the ACM Symposium on Operating Systems Principles. 147--160. Google ScholarDigital Library
Giorgos Margaritis and Stergios V. Anastasiadis. 2009. Low-cost management of inverted files for online full-text search. In Proceedings of the ACM Conference on Information and Knowledge Management. 455--464. Google ScholarDigital Library
Michael Mccandless, Erik Hatcher, and Otis Gospodnetic. 2010. Lucene in Action. Manning Publications, Stamford, CT.Google Scholar
Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. 2001. Building a distributed full-text index for the web. ACM Trans. Inf. Syst. 19, 3, 217--241. Google ScholarDigital Library
Alexandros Ntoulas and Junghoo Cho. 2007. Pruning policies for two-tiered inverted index with correctness guarantee. In Proceedings of the 30^th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval. 191--198. Google ScholarDigital Library
R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, and Jim Zelenka. 1995. Informed prefetching and caching. In Proceedings of the ACM Symposium on Operating Systems Principles. 79--95. Google ScholarDigital Library
Daniel Peng and Frank Dabek. 2010. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 251--264. Google ScholarDigital Library
Martin F. Porter. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarCross Ref
Mohit Saxena, Michael M. Swift, and Yiying Zhang. 2012. FlashTier: A lightweight, consistent and durable storage cache. In Proceedings of the ACM European Conference on Computer Systems. 267--280. Google ScholarDigital Library
Sam Shah, Craig A. N. Soules, Gregory R. Ganger, and Brian D. Noble. 2007. Using provenance to aid in personal file search. In Proceedings of the USENIX Annual Technical Conference. 171--184. Google ScholarDigital Library
Trevor Strohman and W. Bruce Croft. 2007. Efficient document retrieval in main memory. In Proceedings of the 30^th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR'07). 175--182. Google ScholarDigital Library
Anthony Tomasic, Hector Garcia-Molina, and Kurt Shoens. 1994. Incremental updates of inverted lists for text document retrieval. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 289--300. Google ScholarDigital Library
Trec. 2006. TREC terabyte track. National Institute of Standards and Technology. http://trec.nist.gov/data/terabyte.html.Google Scholar
Zheng Wei and Joseph Jaja. 2012. An optimized high-throughput strategy for constructing inverted files. IEEE Trans. Parallel Distrib. Syst. 23, 11, 2033--2044. Google ScholarDigital Library
Wikipedia. 2008. The wikipedia dataset. http://static.wikipedia.org/downloads/2008-06/en/.Google Scholar
Hugh E. Williams, Justin Zobel, and Dirk Bahle. 2004. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. 22, 4, 573--594. Google ScholarDigital Library
Wumpus. 2011. Wumpus search engine. http://www.wumpus-search.org.Google Scholar
Zettair. 2009. The Zettair search engine. RMIT university. http://www.seg.rmit.edu.au/zettair/.Google Scholar
Mingjie Zhu, Shuming Shi, Nenghai Yu, and Ji-Rong Wen. 2008. Can phrase indexing help to process non-phrase queries. In Proceedings of the ACM Conference on Information and Knowledge Management. 679--688. Google ScholarDigital Library
Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. Comput. Surv. 38, 2, 6:1--6:56. Google ScholarDigital Library
Justin Zobel, Alistair Moffat, and Ron Sacks-Davis. 1993. Storage management for files of dynamic records. In Proceedings of the Australian Database Conference. 26--38.Google Scholar
Zoie. 2008. Zoie real-time search and indexing system built on Apache Lucene. http://code.google.com/p/zoie/wiki/ZoieMergePolicy.Google Scholar

Index Terms

Incremental Text Indexing for Fast Disk-Based Search
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing
  2. Information storage systems
    1. Record storage systems
      1. Record storage alternatives
        Hashed file organization
        Indexed file organization

Recommendations

Inverted files versus signature files for text indexing

Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We ...
Read More
Low-cost management of inverted files for online full-text search
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

In dynamic environments with frequent content updates, we require online full-text search that scales to large data collections and achieves low search latency. Several recent methods that support fast incremental indexing of documents typically keep on ...
Read More
Higher reliability redundant disk arrays: Organization, operation, and coding

Parity is a popular form of data protection in redundant arrays of inexpensive/independent disks (RAID). RAID5 dedicates one out of N disks to parity to mask single disk failures, that is, the contents of a block on a failed disk can be reconstructed by ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on the Web Volume 8, Issue 3
June 2014
256 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2639948
Editor:
Marc Najork
Google
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 July 2014
- Accepted: 1 December 2013
- Revised: 1 November 2013
- Received: 1 September 2012
Published in tweb Volume 8, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Inverted files
online maintenance
performance evaluation
prototype implementation
search engines
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 237
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Incremental Text Indexing for Fast Disk-Based Search

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Inverted files versus signature files for text indexing

Low-cost management of inverted files for online full-text search

Higher reliability redundant disk arrays: Organization, operation, and coding