skip to main content
10.1145/1135777.1135882acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Optimizing scoring functions and indexes for proximity search in type-annotated corpora

Published:23 May 2006Publication History

ABSTRACT

We introduce a new, powerful class of text proximity queries: find an instance of a given "answer type" (person, place, distance) near "selector" tokens matching given literals or satisfying given ground predicates. An example query is type=distance NEAR Hamburg Munich. Nearness is defined as a flexible, trainable parameterized aggregation function of the selectors, their frequency in the corpus, and their distance from the candidate answer. Such queries provide a key data reduction step for information extraction, data integration, question answering, and other text-processing applications. We describe the architecture of a next-generation information retrieval engine for such applications, and investigate two key technical problems faced in building it. First, we propose a new algorithm that estimates a scoring function from past logs of queries and answer spans. Plugging the scoring function into the query processor gives high accuracy: typically, an answer is found at rank 2-4. Second, we exploit the skew in the distribution over types seen in query logs to optimize the space required by the new index structures required by our system. Extensive performance studies with a 10GB, 2-million document TREC corpus and several hundred TREC queries show both the accuracy and the efficiency of our system. From an initial 4.3GB index using 18,000 types from WordNet, we can discard 88% of the space, while inflating query times by a factor of only 1.9. Our final index overhead is only 20% of the total index space needed.

References

  1. S. Amer-Yahia, C. Botev, and J. Shanmugasundaram. TeXQuery: A full-text search extension to XQuery. In WWW Conference, pages 583--594, New York, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apache Software Group. Jakarta Lucene text search engine. GPL Library, 2002.Google ScholarGoogle Scholar
  3. A. Balmin, V. Hristidis, and Y. Papakonstantinou. Authority-based keyword queries in databases using ObjectRank. In VLDB, Toronto, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. J. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW Conference, pages 442--452, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. T. Chinenyanga and N. Kushmerick. An expressive and efficient language for XML information retrieval. JASIST, 53(6):438--453, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Cimiano, G. Ladwig, and S. Staab. Gimme' the context: Context-driven automatic semantic annotation with C-PANKOW. In WWW Conference, pages 332--341. ACM Press, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Cohen, E. Halperin, H. Kaplan, and U. Zwick. Reachability and distance queries via 2-hop labels. SIAM Journal of Computing, 32(5):1338--1355, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. GATE: A TIPSTER-based general architecture for text engineering. In Proceedings of the TIPSTER Text Program (Phase III) 6 Month Workshop. Morgan-Kaufmann, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. S. de Moura et al. Improving web search efficiency via a locality-based static pruning method. In WWW Conference, pages 235--244, Chiba, Japan, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Dill et al. SemTag and Seeker: Bootstrapping the semantic Web via automated semantic annotation. In WWW Conference, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. O. Etzioni, M. Cafarella, et al. Web-scale information extraction in KnowItAll. In WWW Conference, New York, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Florescu, D. Kossman, and I. Manolescu. Integrating keyword searches into XML query processing. In WWW Conference, pages 119--135, Amsterdam, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Fuhr and K. Grosjohann. XIRQL: A query language for information retrieval in XML documents. In Research and Development in Information Retrieval, pages 172--180, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Graupmann, M. Biwer, C. Zimmer, P. Zimmer, M. Bender, M. Theobald, and G. Weikum. COMPASS: A concept-based Web search engine for HTML, XML, and deep Web data. In VLDB, pages 1313--1316, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. V. Guha and R. McCool. Tap: A semantic web test-bed. J. Web Sem., 1(1):81--87, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  16. L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked keyword search over XML documents. In SIGMOD Conference, pages 16--27, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Herbrich, T. Graepel, and K. Obermayer. Support vector learning for ordinal regression. In International Conference on Artificial Neural Networks, pages 97--102, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  18. Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In SIGMOD Conference, pages 268--277, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD Conference. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In ICDE, page 829, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Krishnan, S. Das, and S. Chakrabarti. Enhanced answer type inference from questions using sequential models. In EMNLP/HLT, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Programming, 45(3, (Ser. B)):503--528, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. U. Manber and R. A. Baeza-Yates. An algorithm for string matching with a sequence of don't cares. Information Processing Letters, 37(3):133--136, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Metzler and W. B. Croft. Combining the language model and inference network approaches to retrieval. Inf. Process. Manage., 40(5):735--750, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Introduction to WordNet: An online lexical database. International Journal of Lexicography, 1993.Google ScholarGoogle Scholar
  27. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In SODA, pages 657--666, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. D. Richardson, W. B. Dolan, and L. Vanderwende. MindNet: Acquiring and structuring semantic information from text. In COLING, pages 1098--1102. ACL, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Suel and X. Long. Three-level caching for efficient query processing in large Web search engines. In WWW Conference, Chiba, Japan, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Theobald and G. Weikum. The XXL search engine: ranked retrieval of XML data using indexes and ontologies. In SIGMOD, page 615, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan-Kaufmann, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing scoring functions and indexes for proximity search in type-annotated corpora

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            WWW '06: Proceedings of the 15th international conference on World Wide Web
            May 2006
            1102 pages
            ISBN:1595933239
            DOI:10.1145/1135777

            Copyright © 2006 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 23 May 2006

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate1,899of8,196submissions,23%

            Upcoming Conference

            WWW '24
            The ACM Web Conference 2024
            May 13 - 17, 2024
            Singapore , Singapore

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader