ABSTRACT
We introduce a new, powerful class of text proximity queries: find an instance of a given "answer type" (person, place, distance) near "selector" tokens matching given literals or satisfying given ground predicates. An example query is type=distance NEAR Hamburg Munich. Nearness is defined as a flexible, trainable parameterized aggregation function of the selectors, their frequency in the corpus, and their distance from the candidate answer. Such queries provide a key data reduction step for information extraction, data integration, question answering, and other text-processing applications. We describe the architecture of a next-generation information retrieval engine for such applications, and investigate two key technical problems faced in building it. First, we propose a new algorithm that estimates a scoring function from past logs of queries and answer spans. Plugging the scoring function into the query processor gives high accuracy: typically, an answer is found at rank 2-4. Second, we exploit the skew in the distribution over types seen in query logs to optimize the space required by the new index structures required by our system. Extensive performance studies with a 10GB, 2-million document TREC corpus and several hundred TREC queries show both the accuracy and the efficiency of our system. From an initial 4.3GB index using 18,000 types from WordNet, we can discard 88% of the space, while inflating query times by a factor of only 1.9. Our final index overhead is only 20% of the total index space needed.
- S. Amer-Yahia, C. Botev, and J. Shanmugasundaram. TeXQuery: A full-text search extension to XQuery. In WWW Conference, pages 583--594, New York, 2004. Google ScholarDigital Library
- Apache Software Group. Jakarta Lucene text search engine. GPL Library, 2002.Google Scholar
- A. Balmin, V. Hristidis, and Y. Papakonstantinou. Authority-based keyword queries in databases using ObjectRank. In VLDB, Toronto, 2004.Google ScholarDigital Library
- M. J. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW Conference, pages 442--452, 2005. Google ScholarDigital Library
- T. T. Chinenyanga and N. Kushmerick. An expressive and efficient language for XML information retrieval. JASIST, 53(6):438--453, 2002. Google ScholarDigital Library
- P. Cimiano, G. Ladwig, and S. Staab. Gimme' the context: Context-driven automatic semantic annotation with C-PANKOW. In WWW Conference, pages 332--341. ACM Press, 2005. Google ScholarDigital Library
- E. Cohen, E. Halperin, H. Kaplan, and U. Zwick. Reachability and distance queries via 2-hop labels. SIAM Journal of Computing, 32(5):1338--1355, 2003. Google ScholarDigital Library
- H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. GATE: A TIPSTER-based general architecture for text engineering. In Proceedings of the TIPSTER Text Program (Phase III) 6 Month Workshop. Morgan-Kaufmann, 1997. Google ScholarDigital Library
- E. S. de Moura et al. Improving web search efficiency via a locality-based static pruning method. In WWW Conference, pages 235--244, Chiba, Japan, 2005. Google ScholarDigital Library
- S. Dill et al. SemTag and Seeker: Bootstrapping the semantic Web via automated semantic annotation. In WWW Conference, 2003. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, et al. Web-scale information extraction in KnowItAll. In WWW Conference, New York, 2004. ACM. Google ScholarDigital Library
- D. Florescu, D. Kossman, and I. Manolescu. Integrating keyword searches into XML query processing. In WWW Conference, pages 119--135, Amsterdam, 2000. Google ScholarDigital Library
- N. Fuhr and K. Grosjohann. XIRQL: A query language for information retrieval in XML documents. In Research and Development in Information Retrieval, pages 172--180, 2001. Google ScholarDigital Library
- J. Graupmann, M. Biwer, C. Zimmer, P. Zimmer, M. Bender, M. Theobald, and G. Weikum. COMPASS: A concept-based Web search engine for HTML, XML, and deep Web data. In VLDB, pages 1313--1316, 2004. Google ScholarDigital Library
- R. V. Guha and R. McCool. Tap: A semantic web test-bed. J. Web Sem., 1(1):81--87, 2003.Google ScholarCross Ref
- L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked keyword search over XML documents. In SIGMOD Conference, pages 16--27, 2003. Google ScholarDigital Library
- R. Herbrich, T. Graepel, and K. Obermayer. Support vector learning for ordinal regression. In International Conference on Artificial Neural Networks, pages 97--102, 1999.Google ScholarCross Ref
- Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In SIGMOD Conference, pages 268--277, 1991. Google ScholarDigital Library
- T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD Conference. ACM, 2002. Google ScholarDigital Library
- R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In ICDE, page 829, 2004. Google ScholarDigital Library
- V. Krishnan, S. Das, and S. Chakrabarti. Enhanced answer type inference from questions using sequential models. In EMNLP/HLT, 2005. Google ScholarDigital Library
- D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Programming, 45(3, (Ser. B)):503--528, 1989. Google ScholarDigital Library
- U. Manber and R. A. Baeza-Yates. An algorithm for string matching with a sequence of don't cares. Information Processing Letters, 37(3):133--136, 1991. Google ScholarDigital Library
- C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999. Google ScholarDigital Library
- D. Metzler and W. B. Croft. Combining the language model and inference network approaches to retrieval. Inf. Process. Manage., 40(5):735--750, 2004. Google ScholarDigital Library
- G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Introduction to WordNet: An online lexical database. International Journal of Lexicography, 1993.Google Scholar
- S. Muthukrishnan. Efficient algorithms for document retrieval problems. In SODA, pages 657--666, 2002. Google ScholarDigital Library
- S. D. Richardson, W. B. Dolan, and L. Vanderwende. MindNet: Acquiring and structuring semantic information from text. In COLING, pages 1098--1102. ACL, 1998. Google ScholarDigital Library
- T. Suel and X. Long. Three-level caching for efficient query processing in large Web search engines. In WWW Conference, Chiba, Japan, 2005. Google ScholarDigital Library
- A. Theobald and G. Weikum. The XXL search engine: ranked retrieval of XML data using indexes and ontologies. In SIGMOD, page 615, 2002. Google ScholarDigital Library
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan-Kaufmann, May 1999. Google ScholarDigital Library
Index Terms
- Optimizing scoring functions and indexes for proximity search in type-annotated corpora
Recommendations
Morphologically Annotated Amharic Text Corpora
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalIn information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the ...
Proximity Scoring Using Sentence-Based Inverted Index for Practical Full-Text Search
ECDL '08: Proceedings of the 12th European conference on Research and Advanced Technology for Digital LibrariesWe propose a search method that uses sentence-based inverted indexes to achieve high accuracy at practical speeds. The proposed method well supports the vast majority of queries entered on the web; these queries contain single words, multiple words for ...
Efficient term proximity search with term-pair indexes
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementThere has been a large amount of research on early termination techniques in web search and information retrieval. Such techniques return the top-k documents without scanning and evaluating the full inverted lists of the query terms. Thus, they can ...
Comments