Article

Optimizing scoring functions and indexes for proximity search in type-annotated corpora

Authors:
Soumen Chakrabarti

IIT Bombay

IIT Bombay
View Profile

,
Kriti Puniyani

IIT Bombay

IIT Bombay
View Profile

,
Sujatha Das

IIT Bombay

IIT Bombay
View Profile

WWW '06: Proceedings of the 15th international conference on World Wide WebMay 2006Pages 717–726https://doi.org/10.1145/1135777.1135882

Published:23 May 2006Publication History

WWW '06: Proceedings of the 15th international conference on World Wide Web

Pages 717–726

ABSTRACT

We introduce a new, powerful class of text proximity queries: find an instance of a given "answer type" (person, place, distance) near "selector" tokens matching given literals or satisfying given ground predicates. An example query is type=distance NEAR Hamburg Munich. Nearness is defined as a flexible, trainable parameterized aggregation function of the selectors, their frequency in the corpus, and their distance from the candidate answer. Such queries provide a key data reduction step for information extraction, data integration, question answering, and other text-processing applications. We describe the architecture of a next-generation information retrieval engine for such applications, and investigate two key technical problems faced in building it. First, we propose a new algorithm that estimates a scoring function from past logs of queries and answer spans. Plugging the scoring function into the query processor gives high accuracy: typically, an answer is found at rank 2-4. Second, we exploit the skew in the distribution over types seen in query logs to optimize the space required by the new index structures required by our system. Extensive performance studies with a 10GB, 2-million document TREC corpus and several hundred TREC queries show both the accuracy and the efficiency of our system. From an initial 4.3GB index using 18,000 types from WordNet, we can discard 88% of the space, while inflating query times by a factor of only 1.9. Our final index overhead is only 20% of the total index space needed.

References

S. Amer-Yahia, C. Botev, and J. Shanmugasundaram. TeXQuery: A full-text search extension to XQuery. In WWW Conference, pages 583--594, New York, 2004. Google ScholarDigital Library
Apache Software Group. Jakarta Lucene text search engine. GPL Library, 2002.Google Scholar
A. Balmin, V. Hristidis, and Y. Papakonstantinou. Authority-based keyword queries in databases using ObjectRank. In VLDB, Toronto, 2004.Google ScholarDigital Library
M. J. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW Conference, pages 442--452, 2005. Google ScholarDigital Library
T. T. Chinenyanga and N. Kushmerick. An expressive and efficient language for XML information retrieval. JASIST, 53(6):438--453, 2002. Google ScholarDigital Library
P. Cimiano, G. Ladwig, and S. Staab. Gimme' the context: Context-driven automatic semantic annotation with C-PANKOW. In WWW Conference, pages 332--341. ACM Press, 2005. Google ScholarDigital Library
E. Cohen, E. Halperin, H. Kaplan, and U. Zwick. Reachability and distance queries via 2-hop labels. SIAM Journal of Computing, 32(5):1338--1355, 2003. Google ScholarDigital Library
H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. GATE: A TIPSTER-based general architecture for text engineering. In Proceedings of the TIPSTER Text Program (Phase III) 6 Month Workshop. Morgan-Kaufmann, 1997. Google ScholarDigital Library
E. S. de Moura et al. Improving web search efficiency via a locality-based static pruning method. In WWW Conference, pages 235--244, Chiba, Japan, 2005. Google ScholarDigital Library
S. Dill et al. SemTag and Seeker: Bootstrapping the semantic Web via automated semantic annotation. In WWW Conference, 2003. Google ScholarDigital Library
O. Etzioni, M. Cafarella, et al. Web-scale information extraction in KnowItAll. In WWW Conference, New York, 2004. ACM. Google ScholarDigital Library
D. Florescu, D. Kossman, and I. Manolescu. Integrating keyword searches into XML query processing. In WWW Conference, pages 119--135, Amsterdam, 2000. Google ScholarDigital Library
N. Fuhr and K. Grosjohann. XIRQL: A query language for information retrieval in XML documents. In Research and Development in Information Retrieval, pages 172--180, 2001. Google ScholarDigital Library
J. Graupmann, M. Biwer, C. Zimmer, P. Zimmer, M. Bender, M. Theobald, and G. Weikum. COMPASS: A concept-based Web search engine for HTML, XML, and deep Web data. In VLDB, pages 1313--1316, 2004. Google ScholarDigital Library
R. V. Guha and R. McCool. Tap: A semantic web test-bed. J. Web Sem., 1(1):81--87, 2003.Google ScholarCross Ref
L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked keyword search over XML documents. In SIGMOD Conference, pages 16--27, 2003. Google ScholarDigital Library
R. Herbrich, T. Graepel, and K. Obermayer. Support vector learning for ordinal regression. In International Conference on Artificial Neural Networks, pages 97--102, 1999.Google ScholarCross Ref
Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In SIGMOD Conference, pages 268--277, 1991. Google ScholarDigital Library
T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD Conference. ACM, 2002. Google ScholarDigital Library
R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In ICDE, page 829, 2004. Google ScholarDigital Library
V. Krishnan, S. Das, and S. Chakrabarti. Enhanced answer type inference from questions using sequential models. In EMNLP/HLT, 2005. Google ScholarDigital Library
D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Programming, 45(3, (Ser. B)):503--528, 1989. Google ScholarDigital Library
U. Manber and R. A. Baeza-Yates. An algorithm for string matching with a sequence of don't cares. Information Processing Letters, 37(3):133--136, 1991. Google ScholarDigital Library
C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999. Google ScholarDigital Library
D. Metzler and W. B. Croft. Combining the language model and inference network approaches to retrieval. Inf. Process. Manage., 40(5):735--750, 2004. Google ScholarDigital Library
G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Introduction to WordNet: An online lexical database. International Journal of Lexicography, 1993.Google Scholar
S. Muthukrishnan. Efficient algorithms for document retrieval problems. In SODA, pages 657--666, 2002. Google ScholarDigital Library
S. D. Richardson, W. B. Dolan, and L. Vanderwende. MindNet: Acquiring and structuring semantic information from text. In COLING, pages 1098--1102. ACL, 1998. Google ScholarDigital Library
T. Suel and X. Long. Three-level caching for efficient query processing in large Web search engines. In WWW Conference, Chiba, Japan, 2005. Google ScholarDigital Library
A. Theobald and G. Weikum. The XXL search engine: ranked retrieval of XML data using indexes and ontologies. In SIGMOD, page 615, 2002. Google ScholarDigital Library
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan-Kaufmann, May 1999. Google ScholarDigital Library

Index Terms

Optimizing scoring functions and indexes for proximity search in type-annotated corpora
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval

Recommendations

Morphologically Annotated Amharic Text Corpora
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

In information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the ...
Read More
Proximity Scoring Using Sentence-Based Inverted Index for Practical Full-Text Search
ECDL '08: Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries

We propose a search method that uses sentence-based inverted indexes to achieve high accuracy at practical speeds. The proposed method well supports the vast majority of queries entered on the web; these queries contain single words, multiple words for ...
Read More
Efficient term proximity search with term-pair indexes
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

There has been a large amount of research on early termination techniques in web search and information retrieval. Such techniques return the top-k documents without scanning and evaluating the full inverted lists of the query terms. Thus, they can ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '06: Proceedings of the 15th international conference on World Wide Web
May 2006
1102 pages
ISBN:1595933239
DOI:10.1145/1135777
General Chairs:
Leslie Carr
University of Southampton
,
David De Roure
University of Southampton
,
Arun Iyengar
IBM Research
,
Program Chairs:
Carole Goble
University of Manchester, UK
,
Mike Dahlin
University of Texas at Austin
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 May 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
indexing annotated text
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 43
  Total Citations
  View Citations
- 712
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing scoring functions and indexes for proximity search in type-annotated corpora

WWW '06: Proceedings of the 15th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Morphologically Annotated Amharic Text Corpora

Proximity Scoring Using Sentence-Based Inverted Index for Practical Full-Text Search

Efficient term proximity search with term-pair indexes