skip to main content
10.1145/1065385.1065455acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

What's there and what's not?: focused crawling for missing documents in digital libraries

Published: 07 June 2005 Publication History

Abstract

Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue.We investigate the feasibility of using publication metadata to guide the crawler towards authors' homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors' homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest using a focused crawler academic papers that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler's performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures.

References

[1]
De Bra, P., Houben, G., Kornatzky, Y., and Post, R Information Retrieval in Distributed Hypertexts. In Proceedings of the 4th RIAO (Computer-Assisted Information Retrieval) Conference, pp. 481--491, 1994.
[2]
Cho J., Garcia-Molina, H., and Page, L. Efficient Crawling Through URL Ordering. In Proceedings of the 7th World Wide Web Conference, Brisbane, Australia, pp. 161--172. April 1998.
[3]
Chakrabarti, S., Van den Berg, M., and Dom, B. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. In Proceedings of the 8th International WWW Conference, pp. 545--562, Toronto, Canada, May 1999.
[4]
Giles, C. L. and Councill, I. G. Who gets acknowledged: Measuring scientific contributions through automatic acknowledgement indexing. In Proceedings of the National Academy of Sciences 101(51) pp. 17599--17604, Dec. 21, 2004.
[5]
Najork, M. and Wiener, J. L. Breadth-First Search Crawling Yields High-Quality Pages. In Proceedings of the 10th International World Wide Web Conference, pp. 114--118, 2001.
[6]
Page, L., Brin, S., Motwani, R., and Winograd, T. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University Database Group, 1998. Available at http://dbpubs.stanford.edu: 8090/pub/1999-66
[7]
Menczer, F., Pant, G., Ruiz, M., and Srinivasan, P. Evaluating Topic-Driven Web Crawlers.' In Proceedings of the 2001 Annual Conference of the Association of Computing Machinery, Special Interest Group in Information Retrieval, 241--249. New Orleans, September 2001.
[8]
Haveliwala, T. H. Topic-Sensitive PageRank. In Proceedings of the 11th International World Wide Web Conference, pp. 517--526. Honolulu, Hawaii, USA. May 2002.
[9]
Mukherjea, S. WTMS: a system for collecting and analyzing topic-specific Web information. Computer Networks 33(1-6): 457--471, 2000.
[10]
Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C. L., and Gori, M. Focused Crawling Using Context Graphs. In Proceedings of the 26th International Conference on Very Large Data Bases, pp. 527--534, 2000.
[11]
Aggarwal, C. C., Al-Garawi, F., and Yu, P. S. Intelligent Crawling on the World Wide Web with Arbitary Predicates. In Proceedings of the Tenth International Conference on World Wide Web, pp. 96--105, 2001.
[12]
Aggarwal, C. C. On Learning Strategies for Topic Specific Web Crawling. Next Generation Data Mining Applications, January 2004.
[13]
Pant, G., Tsjoutsiouliklis, K., Johnson, J., and Giles, C. L. Panorama: Extending Digital Libraries with Topical Crawlers. In Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, pp. 142--150, 2004.
[14]
Menczer, F., Pant, G., and Srinivasan, P. Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM TOIT 4(4): 378--419, 2004.
[15]
Pant, G., Srinivasan, P., and Menczer, F. Crawling the Web. In M. Levene and A. Poulovassilis, eds.: Web Dynamics, Springer, 2004.
[16]
Hoff, G. and Mundhenk, M. Finding scientific papers with homepagesearch and MOPS. In Proceedings of the Nineteenth Annual International Conference of Computer Documentation, Communicating in the New Millennium, pp. 201--207. October 21-24, 2001, Santa Fe, New Mexico, USA.
[17]
On, B. and Lee, D. PaSE: Locating Online Copy of Scientific Documents Effectively. In Proceedings of the 7th International Conference of Asian Digital Libraries (ICADL), pp. 408--418. Shanghai, China, December 2004.
[18]
Shakes, J., Langheinrich, M., and Etzioni, O. Dynamic Reference Sifting: a Case Study in the Homepage Domain. In Proceedings of the Sixth International World Wide Web Conference, pp. 189--200, 1997.
[19]
Xi, W. and Fox, E. A. Machine Learning Approach for Homepage Finding Task. In Proceedings of the Tenth Text REtrieval Conference (TREC 2001), pp. 686--698, 2001.
[20]
Anh, V. N. and Moffat, A. Homepage Finding and Topic Distillation using a Common Retrieval Strategy. In Proceedings of the Eleventh Text REtrieval Conference (TREC 2002), 2002.
[21]
Ogilvie, P. and Callan, J. Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003), pp. 177--184, 2003.
[22]
Sundaresan, N., Yi, J., and Huang, A. W. Using Metadata to Enhance a Web Information Gathering System. In Proceedings of the Third International Workshop on the Web and Databases (WebDB 2000), pp. 11--16, 2000.
[23]
Flesca, S., Furfaro, F., and Greco, S. Weighted Path Queries on Web Data. In Proceedings of the Fourth International Workshop on the Web and Databases (WebDB 2001), pp. 7--12, 2001.
[24]
Ruiz, A., López-de-Teruel, P. E., and Garrido, M. C. Probabilistic Inference from Arbitrary Uncertainty using Mixtures of Factorized Generalized Gaussians. Journal of Artificial Intelligence Research (JAIR), Volume 9, pp. 167--217, 1998.
[25]
Russell, G., Neumüller, M., and Connor, R. C. H. TypEx: A Type Based Approach to XML Stream Querying. In Proceedings of the Sixth International Workshop on the Web and Databases (WebDB 2003), pp. 55--60, 2003.

Cited By

View all
  • (2022)A grey zone for bibliometrics: publications indexed in Web of Science as anonymousScientometrics10.1007/s11192-022-04494-4127:10(5989-6009)Online publication date: 12-Aug-2022
  • (2019)An Approach for Focused Crawler to Harvest Digital Academic Documents in Online Digital LibrariesInternational Journal of Information Retrieval Research10.4018/IJIRR.20190701039:3(23-47)Online publication date: Jul-2019
  • (2018)Scraping SERPs for Archival SeedsProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197056(263-272)Online publication date: 23-May-2018
  • Show More Cited By

Index Terms

  1. What's there and what's not?: focused crawling for missing documents in digital libraries

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      JCDL '05: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
      June 2005
      450 pages
      ISBN:1581138768
      DOI:10.1145/1065385
      • General Chair:
      • Mary Marlino,
      • Program Chairs:
      • Tamara Sumner,
      • Frank Shipman
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 June 2005

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. ACM
      2. CiteSeer
      3. DBLP
      4. digital libraries
      5. focused crawler
      6. harvesting

      Qualifiers

      • Article

      Conference

      JCDL05

      Acceptance Rates

      Overall Acceptance Rate 415 of 1,482 submissions, 28%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)A grey zone for bibliometrics: publications indexed in Web of Science as anonymousScientometrics10.1007/s11192-022-04494-4127:10(5989-6009)Online publication date: 12-Aug-2022
      • (2019)An Approach for Focused Crawler to Harvest Digital Academic Documents in Online Digital LibrariesInternational Journal of Information Retrieval Research10.4018/IJIRR.20190701039:3(23-47)Online publication date: Jul-2019
      • (2018)Scraping SERPs for Archival SeedsProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197056(263-272)Online publication date: 23-May-2018
      • (2017)A survey on scholarly dataInformation Processing and Management: an International Journal10.1016/j.ipm.2017.03.00653:4(923-944)Online publication date: 1-Jul-2017
      • (2017)A survey of Web crawlers for information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.12187:6Online publication date: 7-Aug-2017
      • (2016)Finding seeds to bootstrap focused crawlersWorld Wide Web10.1007/s11280-015-0331-719:3(449-474)Online publication date: 1-May-2016
      • (2014)Performance Optimization of Focused Web Crawling Using Content Block SegmentationProceedings of the 2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies10.1109/ICESC.2014.69(365-370)Online publication date: 9-Jan-2014
      • (2014)Decrease in free computer science papers found through Google ScholarOnline Information Review10.1108/OIR-07-2013-015938:3(348-361)Online publication date: 29-Apr-2014
      • (2014)Who and what links to the Internet ArchiveInternational Journal on Digital Libraries10.1007/s00799-014-0111-514:3-4(101-115)Online publication date: 1-Aug-2014
      • (2013)A user-oriented web crawler for selectively acquiring online content in e-health researchBioinformatics10.1093/bioinformatics/btt57130:1(104-114)Online publication date: 29-Sep-2013
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media