|
ABSTRACT
There are two problems associated with conventional web crawling techniques: a crawler cannot know if all resources at a non-trivial web site have been discovered and crawled ("the counting problem") and the human-readable format of the resources are not always suitable for machine processing ("the representation problem"). We introduce an approach that solves these two problems by implementing support for both the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and MPEG-21 Digital Item Declaration Language (DIDL) into the web server itself. We present the Apache module "mod_oai", which can be used to address the counting problem by listing all valid URIs at a web server and efficiently discovering updates and additions on subsequent crawls. Our experiments indicated comparable performance for initial crawls, and dramatic increases in update speed mod_oaican also be used to address the representation problem by providing "preservation ready" versions of web resources aggregated with their respective forensic metadata in MPEG-21 DIDL format.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Creating google sitemaps files. http://www.google.com/support/webmasters/bin/topic.py?topic=8467.
|
| |
2
|
GNU wget GNU Project Free Software Foundation (FSF). http://www.gnu.org/software/wget/wget.html.
|
| |
3
|
Windows live search academic. http://academic.live.com/Publishers_Faq.htm.
|
| |
4
|
J. Bekaert, P. Hochstenbach, and H. Van de Sompel. Using MPEG-21 DIDL to represent complex digital objects in the Los Alamos National Laboratory digital library. D-Lib Magazine, 9(11), 2003.
|
| |
5
|
J. Bekaert and H. Van de Sompel. A standards-based solution for the accurate transfer of digital assets. D-Lib Magazine, 11(6), 2005.
|
| |
6
|
|
| |
7
|
J. Bekaert and N. Rump. MPEG-21 DII (Output Document of the 71st MPEG Meeting, Hong Kong, China, ISO/IEC JTC1/SC29/WG11/N6928). Technical report, 2005.
|
| |
8
|
M. K. Bergman. The deep web: Surfacing hidden value. Journal of Electronic Publishing, 7(1), 2001.
|
 |
9
|
|
| |
10
|
Consultative Committee for Space Data Systems. Reference Model for an Open Archival Information System (OAIS). Tech Report CCSDS 650.0-B-1, 2002.
|
| |
11
|
|
 |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
S. Granneman. The perils of googling. http://www.theregister.co.uk/2004/03/10 /the_perils_of_googling/, 2004.
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
 |
20
|
Panagiotis G. Ipeirotis , Luis Gravano , Mehran Sahami, Probe, count, and classify: categorizing hidden web databases, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.67-78, May 21-24, 2001, Santa Barbara, California, United States
|
| |
21
|
ISO/IEC. ISO/IEC 21000-2:2005 information technology - multimedia framework (MPEG-21) - part 2: Digital item declaration - schema for derived DIDL types. http://purl.lanl.gov/STB-RL/schemas /2004-11/DIDL.xsd.
|
| |
22
|
A. Klein. The insecure indexing vulnerability. http://www.webappsec.org/projects /articles/022805.shtml, 2005.
|
 |
23
|
|
| |
24
|
C. Lagoze, H. Van de Sompel, M. L. Nelson, and S. Warner. Implementation guidelines for the Open Archives Initiative Protocol for Metadata Harvesting. http://www.openarchives.org/OAI/2.0/guidelines.htm, 2005.
|
| |
25
|
X. Liu. XML schema defining a subset of HTTP headers used by mod_oai project. http://purl.lanl.gov/STB-RL/schemas/2004-08 /HTTP-HEADER.xsd.
|
| |
26
|
|
| |
27
|
|
| |
28
|
P. Lyman. Archiving the world wide web. In Building a National Strategy for Preservation: Issues in Digital Media Archiving. Council on Library and Information Resources, 2002.
|
| |
29
|
|
| |
30
|
|
| |
31
|
M. L. Nelson, J. Bollen, G. Manepalli, and R. Haq. Archive ingest and handling test: The Old Dominion University Approach. D-Lib Magazine, 11(12), 2005.
|
 |
32
|
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
J. Reagle. Web RSS (syndication) history. http://goatee.net/2003/rss-history.html, 2003.
|
| |
37
|
H. Suleman. OAI-PMH2 XMLFile file-based data provider. http://www.dlib.vt.edu/projects/OAI/ software/xmlfile/xmlfile.html, 2002.
|
| |
38
|
D. Sullivan. A closer look at privacy & desktop search. http://searchenginewatch.com/sereport/article.php/3421621, 2004.
|
| |
39
|
Robert Tansley , Mick Bass , David Stuve , Margret Branschofsky , Daniel Chudnov , Greg McClellan , MacKenzie Smith, The DSpace institutional digital repository system: current functionality, Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, May 27-31, 2003, Houston, Texas
|
| |
40
|
H. Van de Sompel, T. Krichel, M. L. Nelson, P. Hochstenbach, V. M. Lyapunov, K. Maly, M. Zubair, M. Kholief, X. Liu, and H. O'Connell. The UPS prototype: An experimental end-user service across e-print archives. D-Lib Magazine, 6(2), 2000.
|
| |
41
|
H. Van de Sompel and C. Lagoze. The Santa Fe Convention of the Open Archives Initiative. D-Lib Magazine, 6(2), 2000.
|
| |
42
|
|
| |
43
|
H. Van de Sompel, M. L. Nelson, C. Lagoze, and S. Warner. Resource harvesting within the OAI-PMH framework. D-Lib Magazine, 10(12), 2004.
|
| |
44
|
H. Van de Sompel, J. A. Young, and T. B. Hickey. Using the OAI-PMH ... differently. D-Lib Magazine, 9(7/8), 2003.
|
| |
45
|
R. Van de Walle, I. Burnett, and G. Dury. ISO/IEC 21000-2 Digital Item Declaration (Output Document of the 70th MPEG Meeting, Palma De Mallorca, Spain, No. ISO/IEC JTC1/SC29/WG11/N6770), 2004.
|
| |
46
|
A. van Hoff, J. Giannandrea, M. Hapner, S. Carter, and M. Medin. The HTTP distribution and replication protocol. W3C Technical Report http://www.w3.org/TR/NOTE-drp, 1997.
|
| |
47
|
S. Weibel. Metadata: The foundations of resource description. D-Lib Magazine, 1(1), 1995.
|
| |
48
|
J. Young. OAIHarvester2. http://www.oclc.org/research/software/oai /harvester2.htm, 2005.
|
|