ACM Home Page
Please provide us with feedback. Feedback
Efficient, automatic web resource harvesting
Full text PdfPdf (704 KB)
Source Workshop On Web Information And Data Management archive
Proceedings of the 8th annual ACM international workshop on Web information and data management table of contents
Arlington, Virginia, USA
SESSION: Web ranking and classification table of contents
Pages: 43 - 50  
Year of Publication: 2006
ISBN:1-59593-525-8
Authors
Michael L. Nelson  Old Dominion University, Norfolk VA
Joan A. Smith  Old Dominion University, Norfolk VA
Ignacio Garcia del Campo  Old Dominion University, Norfolk VA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 131,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183550.1183560
What is a DOI?

ABSTRACT

There are two problems associated with conventional web crawling techniques: a crawler cannot know if all resources at a non-trivial web site have been discovered and crawled ("the counting problem") and the human-readable format of the resources are not always suitable for machine processing ("the representation problem"). We introduce an approach that solves these two problems by implementing support for both the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and MPEG-21 Digital Item Declaration Language (DIDL) into the web server itself. We present the Apache module "mod_oai", which can be used to address the counting problem by listing all valid URIs at a web server and efficiently discovering updates and additions on subsequent crawls. Our experiments indicated comparable performance for initial crawls, and dramatic increases in update speed mod_oaican also be used to address the representation problem by providing "preservation ready" versions of web resources aggregated with their respective forensic metadata in MPEG-21 DIDL format.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Creating google sitemaps files. http://www.google.com/support/webmasters/bin/topic.py?topic=8467.
 
2
GNU wget GNU Project Free Software Foundation (FSF). http://www.gnu.org/software/wget/wget.html.
 
3
Windows live search academic. http://academic.live.com/Publishers_Faq.htm.
 
4
J. Bekaert, P. Hochstenbach, and H. Van de Sompel. Using MPEG-21 DIDL to represent complex digital objects in the Los Alamos National Laboratory digital library. D-Lib Magazine, 9(11), 2003.
 
5
J. Bekaert and H. Van de Sompel. A standards-based solution for the accurate transfer of digital assets. D-Lib Magazine, 11(6), 2005.
 
6
 
7
J. Bekaert and N. Rump. MPEG-21 DII (Output Document of the 71st MPEG Meeting, Hong Kong, China, ISO/IEC JTC1/SC29/WG11/N6928). Technical report, 2005.
 
8
M. K. Bergman. The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1), 2001.
9
 
10
Consultative Committee for Space Data Systems. Reference Model for an Open Archival Information System (OAIS). Tech Report CCSDS 650.0-B-1, 2002.
 
11
12
13
 
14
 
15
 
16
S. Granneman. The perils of googling. http://www.theregister.co.uk/2004/03/10 /the_perils_of_googling/, 2004.
17
 
18
 
19
20
 
21
ISO/IEC. ISO/IEC 21000-2:2005 information technology - multimedia framework (MPEG-21) - part 2: Digital item declaration - schema for derived DIDL types. http://purl.lanl.gov/STB-RL/schemas /2004-11/DIDL.xsd.
 
22
A. Klein. The insecure indexing vulnerability. http://www.webappsec.org/projects /articles/022805.shtml, 2005.
23
 
24
C. Lagoze, H. Van de Sompel, M. L. Nelson, and S. Warner. Implementation guidelines for the Open Archives Initiative Protocol for Metadata Harvesting. http://www.openarchives.org/OAI/2.0/guidelines.htm, 2005.
 
25
X. Liu. XML schema defining a subset of HTTP headers used by mod_oai project. http://purl.lanl.gov/STB-RL/schemas/2004-08 /HTTP-HEADER.xsd.
 
26
 
27
 
28
P. Lyman. Archiving the world wide web. In Building a National Strategy for Preservation: Issues in Digital Media Archiving. Council on Library and Information Resources, 2002.
 
29
 
30
 
31
M. L. Nelson, J. Bollen, G. Manepalli, and R. Haq. Archive ingest and handling test: The Old Dominion University Approach. D-Lib Magazine, 11(12), 2005.
32
33
 
34
 
35
 
36
J. Reagle. Web RSS (syndication) history. http://goatee.net/2003/rss-history.html, 2003.
 
37
H. Suleman. OAI-PMH2 XMLFile file-based data provider. http://www.dlib.vt.edu/projects/OAI/ software/xmlfile/xmlfile.html, 2002.
 
38
D. Sullivan. A closer look at privacy & desktop search. http://searchenginewatch.com/sereport/article.php/3421621, 2004.
 
39
 
40
H. Van de Sompel, T. Krichel, M. L. Nelson, P. Hochstenbach, V. M. Lyapunov, K. Maly, M. Zubair, M. Kholief, X. Liu, and H. O'Connell. The UPS prototype: An experimental end-user service across e-print archives. D-Lib Magazine, 6(2), 2000.
 
41
H. Van de Sompel and C. Lagoze. The Santa Fe Convention of the Open Archives Initiative. D-Lib Magazine, 6(2), 2000.
 
42
 
43
H. Van de Sompel, M. L. Nelson, C. Lagoze, and S. Warner. Resource harvesting within the OAI-PMH framework. D-Lib Magazine, 10(12), 2004.
 
44
H. Van de Sompel, J. A. Young, and T. B. Hickey. Using the OAI-PMH ... differently. D-Lib Magazine, 9(7/8), 2003.
 
45
R. Van de Walle, I. Burnett, and G. Dury. ISO/IEC 21000-2 Digital Item Declaration (Output Document of the 70th MPEG Meeting, Palma De Mallorca, Spain, No. ISO/IEC JTC1/SC29/WG11/N6770), 2004.
 
46
A. van Hoff, J. Giannandrea, M. Hapner, S. Carter, and M. Medin. The HTTP distribution and replication protocol. W3C Technical Report http://www.w3.org/TR/NOTE-drp, 1997.
 
47
S. Weibel. Metadata: The foundations of resource description. D-Lib Magazine, 1(1), 1995.
 
48
J. Young. OAIHarvester2. http://www.oclc.org/research/software/oai /harvester2.htm, 2005.


Collaborative Colleagues:
Michael L. Nelson: colleagues
Joan A. Smith: colleagues
Ignacio Garcia del Campo: colleagues