|
ABSTRACT
The proliferation of searchable text databases on local area networks and the Internet causes the problem of finding information that may be distributed among many disjoint text databases (distributed information retrieval). How to merge the results returned by selected databases is an important subproblem of the distributed information retrieval task. Previous research assumed that either resource providers cooperate to provide normalizing statistics or search clients download all retrieved documents and compute normalized scores without cooperation from resource providers.This article presents a semisupervised learning solution to the result merging problem. The key contribution is the observation that information used to create resource descriptions for resource selection can also be used to create a centralized sample database to guide the normalization of document scores returned by different databases. At retrieval time, the query is sent to the selected databases, which return database-specific document scores, and to a centralized sample database, which returns database-independent document scores. Documents that have both a database-specific score and a database-independent score serve as training data for learning to normalize the scores of other documents. An extensive set of experiments demonstrates that this method is more effective than the well-known CORI result-merging algorithm under a variety of conditions.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Buckley, C., Singhal, A., Mitra, M., and Salton, G. 1995. New retrieval approaches using SMART. In Proceedings of 1995 Text REtrieval Conference (TREC-3), special publication. National Institute of Standards and Technology.
|
| |
3
|
Callan, J. 2000. Distributed information retrieval. In Advances in Information Retrieval, W. B. Croft, Ed. Kluwer Academic Publishers, pp. 127--150.
|
 |
4
|
|
| |
5
|
|
 |
6
|
James P. Callan , Zhihong Lu , W. Bruce Croft, Searching distributed collections with inference networks, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.21-28, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215328]
|
 |
7
|
Nick Craswell , Peter Bailey , David Hawking, Server selection on the World Wide Web, Proceedings of the fifth ACM conference on Digital libraries, p.37-46, June 02-07, 2000, San Antonio, Texas, United States
[doi> 10.1145/336597.336628]
|
| |
8
|
Craswell, N., Hawking, D., and Thistlewaite, P. 1999. Merging results from isolated search engines. In Proceedings of the 10th Australasian Database Conference. pp. 189--200.
|
 |
9
|
James C. French , Allison L. Powell , Jamie Callan , Charles L. Viles , Travis Emmitt , Kevin J. Prey , Yun Mou, Comparing the performance of database selection algorithms, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.238-245, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312684]
|
 |
10
|
|
 |
11
|
Luis Gravano , Chen-Chuan K. Chang , Héctor García-Molina , Andreas Paepcke, STARTS: Stanford proposal for Internet meta-searching, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.207-218, May 11-15, 1997, Tucson, Arizona, United States
|
 |
12
|
Luis Gravano , Héctor García-Molina , Anthony Tomasic, The effectiveness of GIOSS for the text database discovery problem, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.126-137, May 24-27, 1994, Minneapolis, Minnesota, United States
|
 |
13
|
|
 |
14
|
|
| |
15
|
Ipeirotis, P. and Gravano, L. 2002. Distributed search over the Hidden-Web: Hierarchical database sampling and selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB).
|
| |
16
|
Kirsch, S. T. 2003. Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents. U.S. Patent 5,659,732.
|
 |
17
|
Leah S. Larkey , Margaret E. Connell , Jamie Callan, Collection selection and results merging with topically organized U.S. patents and TREC data, Proceedings of the ninth international conference on Information and knowledge management, p.282-289, November 06-11, 2000, McLean, Virginia, United States
[doi> 10.1145/354756.354830]
|
| |
18
|
|
 |
19
|
|
| |
20
|
Lemur Toolkit. 2003. http://www.cs.cmu.edu/∼lemur.
|
 |
21
|
|
| |
22
|
Ogilvie, P. and Callan, J. 2001. Experiments using the Lemur toolkit. In Proceedings of 2001 Text REtrieval Conference (TREC 2001), special publication. National Institute of Standards and Technology, Washington, DC.
|
 |
23
|
|
 |
24
|
Allison L. Powell , James C. French , Jamie Callan , Margaret Connell , Charles L. Viles, The impact of database selection on distributed searching, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, p.232-239, July 24-28, 2000, Athens, Greece
[doi> 10.1145/345508.345584]
|
| |
25
|
|
 |
26
|
|
 |
27
|
Luo Si , Rong Jin , Jamie Callan , Paul Ogilvie, A language modeling framework for resource selection and results merging, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA
[doi> 10.1145/584792.584856]
|
 |
28
|
|
 |
29
|
|
 |
30
|
Ellen M. Voorhees , Narendra K. Gupta , Ben Johnson-Laird, Learning collection fusion strategies, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.172-179, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215357]
|
 |
31
|
|
 |
32
|
|
CITED BY 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Milad Shokouhi , Justin Zobel , Yaniv Bernstein, Distributed text retrieval from overlapping collections, Proceedings of the eighteenth conference on Australasian database, p.141-150, January 30-February 02, 2007, Ballarat, Victoria, Australia
|
|