skip to main content
10.1145/2428736.2428774acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Size estimation of non-cooperative data collections

Published: 03 December 2012 Publication History

Abstract

With the increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping the crawling or sampling processes which can be so costly in some cases [14]. This tendency to know the sizes of data sources is increased by the competition among businesses on the Web in which the data coverage is critical. In the context of quality assessment of search engines [7], search engine selection in the federated search engines, and in the resource/collection selection in the distributed search field [19], this information is also helpful. In addition, it can give an insight over some useful statistics for public sectors like governments. In any of these mentioned scenarios, in the case of facing a non-cooperative collection which does not publish its information, the size has to be estimated [17]. In this paper, the suggested approaches for this purpose in the literature are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent.

References

[1]
Amstrup, S., McDonald, T., and Manly, B. F. Handbook of Capture-Recapture Analysis. Princeton University Press, Princeton, NJ, Oct. 2005.
[2]
Anagnostopoulos, A., Broder, A. Z., and Carmel, D. Sampling search-engine results. In WWW '05: Proceedings of the 14th international conference on World Wide Web (New York, NY, USA, 2005), ACM Press, pp. 245--256.
[3]
Bar-Yossef, Z., and Gurevich, M. Random sampling from a search engine's index. In Proceedings of the 15th international conference on World Wide Web (New York, NY, USA, 2006), WWW '06, ACM, pp. 367--376.
[4]
Bar-Yossef, Z., and Gurevich, M. Efficient search engine measurements. Proceedings of the 16th international conference on World Wide Web (2007), 401--410.
[5]
Bar-Yossef, Z., and Gurevich, M. Efficient search engine measurements. ACM Trans. Web 5, 4 (Oct. 2011), 18:1--18:48.
[6]
Bharat, K., and Broder, A. A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst. 30 (April 1998), 379--388.
[7]
Broder, A. Z., Fontoura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S. U., Panigrahy, R., Tomkins, A., and Xu, Y. Estimating corpus size via queries. In CIKM (2006), pp. 594--603.
[8]
Callan, J. P., and Connell, M. E. Query-based sampling of text databases. ACM Trans. Inf. Syst. 19, 2 (2001), 97--130.
[9]
Dasgupta, A., Jin, X., Jewell, B., Zhang, N., and Das, G. Unbiased estimation of size and other aggregates over hidden web databases. In Proceedings of the 2010 international conference on Management of data (New York, NY, USA, 2010), SIGMOD '10, ACM, pp. 855--866.
[10]
dmoz. http://dmoz.org, Title=The open directory project.
[11]
Geiger, D., Heckerman, D., and Meek, C. Introduction to monte carlo methods. In Learning in graphical models, M. Jordan, Ed. Kluwer, 1998, pp. 175--289.
[12]
Gulli, A., and Signorini, A. The indexable web is more than 11.5 billion pages. In Special interest tracks and posters of the 14th international conference on World Wide Web (New York, NY, USA, 2005), WWW '05, ACM, pp. 902--903.
[13]
Kern, J. C. An introduction to regression analysis. American Statistician 61, 1 (2007), 101--101.
[14]
Lu, J. Ranking bias in deep web size estimation using capture recapture method. Data Knowl. Eng. 69, 8 (Aug. 2010), 866--879.
[15]
Lu, J., and Li, D. Estimating deep web data source size by capture---recapture method. Inf. Retr. 13, 1 (Feb. 2010), 70--95.
[16]
lup Liu, K., Yu, C., and Meng, W. Discovering the representative of a search engine. In In Proc. CIKM (2001).
[17]
Shokouhi, M., Zobel, J., Scholer, F., and Tahaghoghi, S. M. M. Capturing collection size for distributed non-cooperative retrieval. In SIGIR (2006), pp. 316--323.
[18]
Thomas, P. Generalising multiple capture-recapture to non-uniform sample sizes. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA, 2008), SIGIR '08, ACM, pp. 839--840.
[19]
Xu, J., Wu, S., and Li, X. Estimating collection size with logistic regression. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA, 2007), SIGIR '07, ACM, pp. 789--790.

Cited By

View all
  • (2017)Federated Patent SearchCurrent Challenges in Patent Information Retrieval10.1007/978-3-662-53817-3_8(213-240)Online publication date: 26-Mar-2017
  • (2016)Efficient web harvesting strategies for monitoring deep web contentProceedings of the 18th International Conference on Information Integration and Web-based Applications and Services10.1145/3011141.3011198(389-393)Online publication date: 28-Nov-2016
  • (2016)Guest Editorial: Special Section on the International Conference on Data EngineeringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.249595828:2(295-296)Online publication date: 1-Feb-2016
  • Show More Cited By

Index Terms

  1. Size estimation of non-cooperative data collections

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      IIWAS '12: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
      December 2012
      432 pages
      ISBN:9781450313063
      DOI:10.1145/2428736
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      • @WAS: International Organization of Information Integration and Web-based Applications and Services

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 December 2012

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. deep web
      2. estimation bias
      3. pool-based size estimation
      4. query-based sampling
      5. regression equations
      6. size estimation
      7. stochastic simulation

      Qualifiers

      • Research-article

      Conference

      IIWAS '12
      Sponsor:
      • @WAS

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 08 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2017)Federated Patent SearchCurrent Challenges in Patent Information Retrieval10.1007/978-3-662-53817-3_8(213-240)Online publication date: 26-Mar-2017
      • (2016)Efficient web harvesting strategies for monitoring deep web contentProceedings of the 18th International Conference on Information Integration and Web-based Applications and Services10.1145/3011141.3011198(389-393)Online publication date: 28-Nov-2016
      • (2016)Guest Editorial: Special Section on the International Conference on Data EngineeringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.249595828:2(295-296)Online publication date: 1-Feb-2016
      • (2016)Efficiently Estimating Statistics of Points of Interests on MapsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.248039728:2(425-438)Online publication date: 1-Feb-2016
      • (2016)Estimating search engine index size variabilityScientometrics10.1007/s11192-016-1863-z107:2(839-856)Online publication date: 1-May-2016
      • (2015)Towards complete coverage in focused web harvestingProceedings of the 17th International Conference on Information Integration and Web-based Applications & Services10.1145/2837185.2837208(1-9)Online publication date: 11-Dec-2015
      • (2014)Theoretical, Qualitative, and Quantitative Analyses of Small-Document Approaches to Resource SelectionACM Transactions on Information Systems10.1145/259097532:2(1-37)Online publication date: 1-Apr-2014

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media