skip to main content
10.1145/1277741.1277828acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Evaluating sampling methods for uncooperative collections

Published: 23 July 2007 Publication History

Abstract

Many server selection methods suitable for distributed information retrieval applications rely, in the absence of cooperation, on the availability of unbiased samples of documents from the constituent collections. We describe a number of sampling methods which depend only on the normal query-response mechanism of the applicable search facilities. We evaluate these methods on a number of collections typical of a personal metasearch application. Results demonstrate that biases exist for all methods, particularly toward longer documents, and that in some cases these biases can be reduced but not eliminated by choice of parameters.We also introduce a new sampling technique, "multiple queries", which produces samples of similar quality to the best current techniques but with significantly reduced cost.

References

[1]
Z. Bar-Yossef, A. Berg, S. Chien, J. Fackcharoenphol, and D. Weitz. Approximating aggregate queries about web pages via random walks. In Proc. VLDB, 2000.
[2]
Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In Proc.WWW, 2006.
[3]
K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. In Proc. 7th WWW, 1998.
[4]
A. Broder, M. Fontura, V. Josi vski, R. Kumar, R. Motwani, S. Nabar, R. Panigrahy, A. Tomkins, and Y. Xu. Estimating corpus size via queries. In Proc.CIKM, 2006.
[5]
J. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proc. SIGIR, 1995.
[6]
N. Craswell, D. Hawking, and P. Thistlewaite. Merging results from isolated search engines. In Proc. Australasian Database Conference, 1999.
[7]
A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In Proc. WWW, 2005. Poster.
[8]
M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. In Proc. 9th WWW, 2000.
[9]
K.-L. Liu, A. Santoso, C. Yu, W. Meng, and C. Zhang. Discovering the representative of a search engine. In Proc.CIKM, 2001. Poster.
[10]
Open directory project. http://dmoz.org/.
[11]
A. L. Powell, J. C. French, J. Callan, M. Connell, and C. L. Viles. The impact of database selection on distributed searching. In Proc. SIGIR, 2000.
[12]
P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. In Proc. AAAI Fall Symposium on Using Uncertainty Within Computation, 2001.
[13]
M. Shokouhi, J. Zobel, F. Scholer, and S. M. M. Tahaghoghi. Capturing collection size for distributed non-cooperative retrieval. In Proc. SIGIR, 2006.
[14]
L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In Proc. SIGIR, 2003.
[15]
J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In Proc. SIGIR, 1999.

Cited By

View all
  • (2015)Distributed Information Retrieval: Developments and StrategiesInternational Journal of Engineering Research in Africa10.4028/www.scientific.net/JERA.16.11016(110-144)Online publication date: Jun-2015
  • (2014)Profiling web archive coverage for top-level domain and content languageInternational Journal on Digital Libraries10.1007/s00799-014-0118-y14:3-4(149-166)Online publication date: 1-Aug-2014
  • (2013)Merging algorithms for enterprise searchProceedings of the 18th Australasian Document Computing Symposium10.1145/2537734.2537750(42-49)Online publication date: 5-Dec-2013
  • Show More Cited By

Index Terms

  1. Evaluating sampling methods for uncooperative collections

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
    July 2007
    946 pages
    ISBN:9781595935977
    DOI:10.1145/1277741
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 July 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. distributed information retrieval
    2. random sampling

    Qualifiers

    • Article

    Conference

    SIGIR07
    Sponsor:
    SIGIR07: The 30th Annual International SIGIR Conference
    July 23 - 27, 2007
    Amsterdam, The Netherlands

    Acceptance Rates

    Overall Acceptance Rate 660 of 3,291 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 22 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2015)Distributed Information Retrieval: Developments and StrategiesInternational Journal of Engineering Research in Africa10.4028/www.scientific.net/JERA.16.11016(110-144)Online publication date: Jun-2015
    • (2014)Profiling web archive coverage for top-level domain and content languageInternational Journal on Digital Libraries10.1007/s00799-014-0118-y14:3-4(149-166)Online publication date: 1-Aug-2014
    • (2013)Merging algorithms for enterprise searchProceedings of the 18th Australasian Document Computing Symposium10.1145/2537734.2537750(42-49)Online publication date: 5-Dec-2013
    • (2013)Vertical selection in the information domain of childrenProceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries10.1145/2467696.2467714(57-66)Online publication date: 22-Jul-2013
    • (2012)To what problem is distributed information retrieval the solution?Journal of the American Society for Information Science and Technology10.1002/asi.2268463:7(1471-1476)Online publication date: 1-Jul-2012
    • (2010)Collection-integral source selection for uncooperative distributed information retrieval environmentsInformation Sciences: an International Journal10.1016/j.ins.2010.03.020180:14(2763-2776)Online publication date: 1-Jul-2010
    • (2009)Exploiting peer relations for distributed multimedia information retrievalProceedings of the 2009 IEEE international conference on Multimedia and Expo10.5555/1698924.1699208(1154-1157)Online publication date: 28-Jun-2009
    • (2009)Robust result merging using sample-based score estimatesACM Transactions on Information Systems10.1145/1508850.150885227:3(1-29)Online publication date: 19-May-2009
    • (2009)Exploiting peer relations for distributed multimedia information retrieval2009 IEEE International Conference on Multimedia and Expo10.1109/ICME.2009.5202704(1154-1157)Online publication date: Jun-2009
    • (2009)Estimating deep web data source size by capture–recapture methodInformation Retrieval10.1007/s10791-009-9107-y13:1(70-95)Online publication date: 13-Aug-2009
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media