skip to main content
10.1145/1321440.1321457acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Just in time indexing for up to the second search

Published: 06 November 2007 Publication History

Abstract

E-commerce and intranet search systems require newly arriving content to be indexed and made available for search within minutes or hours of arrival. Applications such as file system and email search demand even faster turnaround from search systems, requiring new content to become available for search almost instantaneously. However, incrementally updating inverted indices, which are the predominant datastructure used in search engines, is an expensive operation that most systems avoid performing at high rates.
We present JiTI, a Just-in-Time Indexing component that allows searching over incoming content (nearly) as soon as that content reaches the system. JiTI's main idea is to invest less in the preprocessing of arriving data, at the expense of a tolerable latency in query response time. It is designed for deployment in search systems that maintain a large main index and that rebuild smaller stop-press indices once or twice an hour. JiTI augments such systems with instant retrieval capabilities over content arriving in between the stop-press builds. A main design point is for JiTI to demand few computational resources, in particular RAM and I/O.
Our experiments consisted of injecting several documents and queries per second concurrently into the system over half-hour long periods. We believe that there are search applications for which the combination of the workloads we experimented with and the response times we measured present a viable solution to a pressing problem.

References

[1]
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology, 1(1):2--43, 2001.
[2]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, New York, NY, 1999.
[3]
E. A. Brewer. Combining systems and databases: A search engine retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems, Fourth edition. MIT Press, February 2005.
[4]
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. ACM SIGMOD international conference on Management of data, pages 255--264, May 1997.
[5]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. 7th International WWW Conference, pages 107--117, 1998.
[6]
A. Broder, D. Carmel, M. Herscovichi, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Twelfth International Conference on Information and Knowledge Management (CIKM 2003), New Orleans, LA, USA, pages 426--434, November 2003.
[7]
E. W. Brown, J. P. Callan, and W. B. Croft. Fast incremental indexing for full-text information retrieval. In Proc. 20th International Conference on Very Large Data Bases (VLDB), pages 192--202, 1994.
[8]
S. Chakrabarti. Mining the Web - Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco, CA, 2003.
[9]
T. Chiueh and L. Huang. Efficient real-time index updates in text retrieval systems. Technical Report ECSL Technical Report 66, Stony Brook University, August 1998.
[10]
J. Cho and H. García-Molina. The evolution of the web and implications for an incremental crawler. In Proc. 26th International Conference on Very Large Data Bases (VLDB2000), pages 200--209, 2000.
[11]
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press and McGraw-Hill, 1990.
[12]
M. Fontoura, J. Zien, E. Shekita, S. Rajagopalan, and A. Neumann. High performance index build algorithms for intranet search engines. In Proc. 30th International Conference on Very Large Data Bases (VLDB 2004), pages 1158--1169. Morgan Kaufmann, August 2004.
[13]
A. S. Foundation. Apache lucene search library. http://lucene.apache.org/.
[14]
R. G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, 1996.
[15]
S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. JASIST, 54(8):713--729, 2003.
[16]
B. J. Jansen, A. Spink, and T. Saracevic. Real life, real users, and real needs: A study and analysis of user queries on the web. Information Processing and Management, 36(2):207--227, 2000.
[17]
B. T. Jóonsson, M. J. Franklin, and D. Srivastava. Interaction of query evaluation and buffer management for information retrieval. In Proc. ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, pages 118--129, June 1998.
[18]
R. Lempel and S. Moran. Predictive caching and prefetching of query results in search engines. In Proc. 12th World Wide Web Conference (WWW2003), Budapest, Hungary, pages 19--27, May 2003.
[19]
N. Lester, J. Zobel, and H. Williams. Efficient online index maintenance for contiguous inverted lists. Information Processing and Management, 42(4):916--933, July 2006.
[20]
M. Lifantsev and T. Chiueh. I/O-concious data preparation for large-scale web search engines. In Proc. 28th International Conference on Very Large Data Bases, Hong Kong, China, pages 382--393, 2002.
[21]
L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Agarwal. Dynamic maintenance of web indexes using landmarks. In Proc. 12th International World Wide Web Conference (WWW2003), pages 102--111, May 2003.
[22]
E. P. Markatos. On caching search engine query results. In Proc. 5th International Web Caching and Content Delivery Workshop, May 2000.
[23]
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In Proc. 10th International WWW Conference, pages 396--406, May 2001.
[24]
S. Mitra, W. W. Hsu, and M. Winslett. Trustworthy keyword search for regulartory-compliant records retention. In Proc. 32nd International Conference on Very Large Data Bases (VLDB'2006), pages 1001--1012, September 2006.
[25]
W.-Y. Shieh and C.-P. Chung. A statistics-based approach to incrementally update inverted files. Information Processing and Management, 41(2):275--288, March 2005.
[26]
C. Silverstein, M. Henzinger, H. Marais, and M. Moricz. Analysis of a very large altavista query log. Technical Report 1998-014, Compaq Systems Research Center, October 1998.
[27]
F. Silvestri. High Performance Issues in Web Search Engines: Algorithms and Techniques. PhD thesis, Dipartimento di Informatica, Università di Pisa, May 2004.
[28]
A. Tomasic, H. García-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. In Proc. 1994 ACM SIGMOD international conference on Management of data, pages 289--300, 1994.
[29]
I. Witten, A. Moffat, and T. Bell. Managing Gigabytes. Morgan Kaufmann Publishers, Inc., San Francisco, CA, second edition, 1999.

Cited By

View all
  • (2022)A proposed conceptual framework for a representational approach to information retrievalACM SIGIR Forum10.1145/3527546.352755255:2(1-29)Online publication date: 17-Mar-2022
  • (2020)Local Variational Feature-Based Similarity Models for Recommending Top-N New ItemsACM Transactions on Information Systems10.1145/337215438:2(1-33)Online publication date: 11-Feb-2020
  • (2017)Partitioning and Segment Organization Strategies for Real-Time Selective Search on Document StreamsProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018727(221-230)Online publication date: 2-Feb-2017
  • Show More Cited By

Index Terms

  1. Just in time indexing for up to the second search

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
    November 2007
    1048 pages
    ISBN:9781595938039
    DOI:10.1145/1321440
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 November 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. incremental indexing
    2. inverted indices
    3. search engines

    Qualifiers

    • Research-article

    Conference

    CIKM07

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)A proposed conceptual framework for a representational approach to information retrievalACM SIGIR Forum10.1145/3527546.352755255:2(1-29)Online publication date: 17-Mar-2022
    • (2020)Local Variational Feature-Based Similarity Models for Recommending Top-N New ItemsACM Transactions on Information Systems10.1145/337215438:2(1-33)Online publication date: 11-Feb-2020
    • (2017)Partitioning and Segment Organization Strategies for Real-Time Selective Search on Document StreamsProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018727(221-230)Online publication date: 2-Feb-2017
    • (2016)A Survey on Wireless Indoor Localization from the Device PerspectiveACM Computing Surveys10.1145/293323249:2(1-31)Online publication date: 30-Jun-2016
    • (2016)Efficient index updates for mixed update and query loads2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840697(984-991)Online publication date: Dec-2016
    • (2014)Bringing arbitrary compute to authoritative dataCommunications of the ACM10.1145/263078757:8(40-48)Online publication date: 1-Aug-2014
    • (2014)Incremental Text Indexing for Fast Disk-Based SearchACM Transactions on the Web10.1145/25608008:3(1-31)Online publication date: 8-Jul-2014
    • (2009)Low-cost management of inverted files for online full-text searchProceedings of the 18th ACM conference on Information and knowledge management10.1145/1645953.1646012(455-464)Online publication date: 2-Nov-2009
    • (2009)Highly scalable algorithm for distributed real-time text indexing2009 International Conference on High Performance Computing (HiPC)10.1109/HIPC.2009.5433193(332-341)Online publication date: Dec-2009
    • (2009)Performance optimizations for distributed real-time text indexing2009 International Conference on High Performance Computing (HiPC)10.1109/HIPC.2009.5433188(398-407)Online publication date: Dec-2009
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media