Article

What's there and what's not?: focused crawling for missing documents in digital libraries

Authors:

C. Lee GilesAuthors Info & Claims

JCDL '05: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries

Pages 301 - 310

https://doi.org/10.1145/1065385.1065455

Published: 07 June 2005 Publication History

Abstract

Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue.We investigate the feasibility of using publication metadata to guide the crawler towards authors' homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors' homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest using a focused crawler academic papers that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler's performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures.

References

[1]

De Bra, P., Houben, G., Kornatzky, Y., and Post, R Information Retrieval in Distributed Hypertexts. In Proceedings of the 4th RIAO (Computer-Assisted Information Retrieval) Conference, pp. 481--491, 1994.

[2]

Cho J., Garcia-Molina, H., and Page, L. Efficient Crawling Through URL Ordering. In Proceedings of the 7th World Wide Web Conference, Brisbane, Australia, pp. 161--172. April 1998.

Digital Library

[3]

Chakrabarti, S., Van den Berg, M., and Dom, B. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. In Proceedings of the 8th International WWW Conference, pp. 545--562, Toronto, Canada, May 1999.

Digital Library

[4]

Giles, C. L. and Councill, I. G. Who gets acknowledged: Measuring scientific contributions through automatic acknowledgement indexing. In Proceedings of the National Academy of Sciences 101(51) pp. 17599--17604, Dec. 21, 2004.

[5]

Najork, M. and Wiener, J. L. Breadth-First Search Crawling Yields High-Quality Pages. In Proceedings of the 10th International World Wide Web Conference, pp. 114--118, 2001.

Digital Library

[6]

Page, L., Brin, S., Motwani, R., and Winograd, T. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University Database Group, 1998. Available at http://dbpubs.stanford.edu: 8090/pub/1999-66

[7]

Menczer, F., Pant, G., Ruiz, M., and Srinivasan, P. Evaluating Topic-Driven Web Crawlers.' In Proceedings of the 2001 Annual Conference of the Association of Computing Machinery, Special Interest Group in Information Retrieval, 241--249. New Orleans, September 2001.

Digital Library

[8]

Haveliwala, T. H. Topic-Sensitive PageRank. In Proceedings of the 11th International World Wide Web Conference, pp. 517--526. Honolulu, Hawaii, USA. May 2002.

Digital Library

[9]

Mukherjea, S. WTMS: a system for collecting and analyzing topic-specific Web information. Computer Networks 33(1-6): 457--471, 2000.

Digital Library

[10]

Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C. L., and Gori, M. Focused Crawling Using Context Graphs. In Proceedings of the 26th International Conference on Very Large Data Bases, pp. 527--534, 2000.

Digital Library

[11]

Aggarwal, C. C., Al-Garawi, F., and Yu, P. S. Intelligent Crawling on the World Wide Web with Arbitary Predicates. In Proceedings of the Tenth International Conference on World Wide Web, pp. 96--105, 2001.

Digital Library

[12]

Aggarwal, C. C. On Learning Strategies for Topic Specific Web Crawling. Next Generation Data Mining Applications, January 2004.

[13]

Pant, G., Tsjoutsiouliklis, K., Johnson, J., and Giles, C. L. Panorama: Extending Digital Libraries with Topical Crawlers. In Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, pp. 142--150, 2004.

Digital Library

[14]

Menczer, F., Pant, G., and Srinivasan, P. Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM TOIT 4(4): 378--419, 2004.

Digital Library

[15]

Pant, G., Srinivasan, P., and Menczer, F. Crawling the Web. In M. Levene and A. Poulovassilis, eds.: Web Dynamics, Springer, 2004.

[16]

Hoff, G. and Mundhenk, M. Finding scientific papers with homepagesearch and MOPS. In Proceedings of the Nineteenth Annual International Conference of Computer Documentation, Communicating in the New Millennium, pp. 201--207. October 21-24, 2001, Santa Fe, New Mexico, USA.

Digital Library

[17]

On, B. and Lee, D. PaSE: Locating Online Copy of Scientific Documents Effectively. In Proceedings of the 7th International Conference of Asian Digital Libraries (ICADL), pp. 408--418. Shanghai, China, December 2004.

Digital Library

[18]

Shakes, J., Langheinrich, M., and Etzioni, O. Dynamic Reference Sifting: a Case Study in the Homepage Domain. In Proceedings of the Sixth International World Wide Web Conference, pp. 189--200, 1997.

Digital Library

[19]

Xi, W. and Fox, E. A. Machine Learning Approach for Homepage Finding Task. In Proceedings of the Tenth Text REtrieval Conference (TREC 2001), pp. 686--698, 2001.

Digital Library

[20]

Anh, V. N. and Moffat, A. Homepage Finding and Topic Distillation using a Common Retrieval Strategy. In Proceedings of the Eleventh Text REtrieval Conference (TREC 2002), 2002.

[21]

Ogilvie, P. and Callan, J. Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003), pp. 177--184, 2003.

[22]

Sundaresan, N., Yi, J., and Huang, A. W. Using Metadata to Enhance a Web Information Gathering System. In Proceedings of the Third International Workshop on the Web and Databases (WebDB 2000), pp. 11--16, 2000.

[23]

Flesca, S., Furfaro, F., and Greco, S. Weighted Path Queries on Web Data. In Proceedings of the Fourth International Workshop on the Web and Databases (WebDB 2001), pp. 7--12, 2001.

[24]

Ruiz, A., López-de-Teruel, P. E., and Garrido, M. C. Probabilistic Inference from Arbitrary Uncertainty using Mixtures of Factorized Generalized Gaussians. Journal of Artificial Intelligence Research (JAIR), Volume 9, pp. 167--217, 1998.

Digital Library

[25]

Russell, G., Neumüller, M., and Connor, R. C. H. TypEx: A Type Based Approach to XML Stream Querying. In Proceedings of the Sixth International Workshop on the Web and Databases (WebDB 2003), pp. 55--60, 2003.

Cited By

Shamsi ASilva RWang TRaju NSantos-d’Amorim K(2022)A grey zone for bibliometrics: publications indexed in Web of Science as anonymousScientometrics10.1007/s11192-022-04494-4127:10(5989-6009)Online publication date: 12-Aug-2022
https://doi.org/10.1007/s11192-022-04494-4
Gupta SDuhan NBansal P(2019)An Approach for Focused Crawler to Harvest Digital Academic Documents in Online Digital LibrariesInternational Journal of Information Retrieval Research10.4018/IJIRR.20190701039:3(23-47)Online publication date: Jul-2019
https://doi.org/10.4018/IJIRR.2019070103
Nwala AWeigle MNelson MChen JGonçalves MAllen JFox EKan MPetras V(2018)Scraping SERPs for Archival SeedsProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197056(263-272)Online publication date: 23-May-2018
https://dl.acm.org/doi/10.1145/3197026.3197056
Show More Cited By

Index Terms

What's there and what's not?: focused crawling for missing documents in digital libraries
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

CiteSeer-API: towards seamless resource location and interlinking for digital libraries
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

We introduce CiteSeer-API, a public API to CiteSeer-like services. CiteSeer-API is SOAP/WSDL based and allows for easy programmatical access to all the specific functionalities offered by CiteSeer services, including full text search of documents and ...
Building interoperable digital library services: MARIAN, open archives, and the NDLTD
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

In this demonstration, we present interoperable and personalized search services for the Networked Digital Library of Theses and Dissertations (NDLTD). Using standard protocols and software, including those specified by the Open Archives Initiative (OAI)...
Enabling interoperability for autonomous digital libraries: an API to citeseer services
JCDL '04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries

We introduce CiteSeer-API, a public API to CiteSeer-like services CiteSeer-API is SOAP/WSDL based and allows for easy programatical access to all the specific functionalities offered by CiteSeer services, including full text search of documents and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

JCDL '05: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries

June 2005

450 pages

ISBN:1581138768

DOI:10.1145/1065385

General Chair:
Mary Marlino
DLESE Program Center, University Corporation for Atmospheric Research (UCAR)
,
Program Chairs:
Tamara Sumner
University of Colorado at Boulder
,
Frank Shipman
Texas A & M University

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

JCDL05

Sponsor:

JCDL05: Joint Conference on Digital Libraries 2005

June 7 - 11, 2005

CO, Denver, USA

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
621
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shamsi ASilva RWang TRaju NSantos-d’Amorim K(2022)A grey zone for bibliometrics: publications indexed in Web of Science as anonymousScientometrics10.1007/s11192-022-04494-4127:10(5989-6009)Online publication date: 12-Aug-2022
https://doi.org/10.1007/s11192-022-04494-4
Gupta SDuhan NBansal P(2019)An Approach for Focused Crawler to Harvest Digital Academic Documents in Online Digital LibrariesInternational Journal of Information Retrieval Research10.4018/IJIRR.20190701039:3(23-47)Online publication date: Jul-2019
https://doi.org/10.4018/IJIRR.2019070103
Nwala AWeigle MNelson MChen JGonçalves MAllen JFox EKan MPetras V(2018)Scraping SERPs for Archival SeedsProceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries10.1145/3197026.3197056(263-272)Online publication date: 23-May-2018
https://dl.acm.org/doi/10.1145/3197026.3197056
Khan SLiu XShakil KAlam M(2017)A survey on scholarly dataInformation Processing and Management: an International Journal10.1016/j.ipm.2017.03.00653:4(923-944)Online publication date: 1-Jul-2017
https://dl.acm.org/doi/10.1016/j.ipm.2017.03.006
Kumar MBhatia RRattan D(2017)A survey of Web crawlers for information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.12187:6Online publication date: 7-Aug-2017
https://doi.org/10.1002/widm.1218
Vieira KBarbosa LSilva AFreire JMoura E(2016)Finding seeds to bootstrap focused crawlersWorld Wide Web10.1007/s11280-015-0331-719:3(449-474)Online publication date: 1-May-2016
https://dl.acm.org/doi/10.1007/s11280-015-0331-7
Ganguly BRaich D(2014)Performance Optimization of Focused Web Crawling Using Content Block SegmentationProceedings of the 2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies10.1109/ICESC.2014.69(365-370)Online publication date: 9-Jan-2014
https://dl.acm.org/doi/10.1109/ICESC.2014.69
A. Pedersen LArendt J(2014)Decrease in free computer science papers found through Google ScholarOnline Information Review10.1108/OIR-07-2013-015938:3(348-361)Online publication date: 29-Apr-2014
https://doi.org/10.1108/OIR-07-2013-0159
Alnoamany YAlsum AWeigle MNelson M(2014)Who and what links to the Internet ArchiveInternational Journal on Digital Libraries10.1007/s00799-014-0111-514:3-4(101-115)Online publication date: 1-Aug-2014
https://dl.acm.org/doi/10.1007/s00799-014-0111-5
Xu SYoon HTourassi G(2013)A user-oriented web crawler for selectively acquiring online content in e-health researchBioinformatics10.1093/bioinformatics/btt57130:1(104-114)Online publication date: 29-Sep-2013
https://doi.org/10.1093/bioinformatics/btt571
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten