skip to main content
10.1145/1557914.1557921acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article

Bringing your dead links back to life: a comprehensive approach and lessons learned

Published: 29 June 2009 Publication History

Abstract

This paper presents an experimental study of the automatic correction of broken (dead) Web links focusing, in particular, on links broken by the relocation ofWeb pages. Our first contribution is that we developed an algorithm that incorporates a comprehensive set of heuristics, some of which are novel, in a single unified framework. The second contribution is that we conducted a relatively large-scale experiment, and analysis of our results revealed the characteristics of the problem of finding movedWeb pages. We demonstrated empirically that the problem of searching for moved pages is different from typical information retrieval problems. First, it is impossible to identify the final destination until the page is moved, so the index-server approach is not necessarily effective. Secondly, there is a large bias about where the new address is likely to be and crawler-based solutions can be effectively implemented, avoiding the need to search the entire Web. We analyzed the experimental results in detail to show how important each heuristic is in real Web settings, and conducted statistical analyses to show that our algorithm succeeds in correctly finding new links for more than 70% of broken links at 95% confidence level.

References

[1]
H. Ashman, H. Davis: Panel Missing the 404: link integrity on the World Wide Web. Computer Networks 30(1-7): 761--762 (1998).
[2]
H. Ashman: Electronic document addressing: dealing with change. ACM Comput. Surv. 32(3): 201--212 (2000)
[3]
Ziv Bar-Yossef, Andrei Z. Broder, Ravi Kumar, Andrew Tomkins: Sic transit gloria telae: towards an understanding of the web's decay. WWW 2004: 328--337
[4]
A. Z. Broder, S. C. Glassman, M. S. Manasse, G. Zweig: Syntactic Clustering of the Web. Computer Networks 29(8-13): 1157--1166 (1997)
[5]
M. Beynon, A. Flegg: Hypertext Request Integrity and User Experience. US Patent Application Publication, US 2004/0267726 A1, Dec, 2004.
[6]
M. Beynon, A. Flegg: Guaranteeing Hypertext Link Integrity. US Patent Application Publication, US 2005/0021997 A1, Jan. 2005.
[7]
Ziv Bar-Yossef, Idit Keidar, Uri Schonfeld: Do not crawl in the dust: different urls with similar text. WWW 2007: 111--120.
[8]
J. Cho, N. Shivakumar, H. Garcia-Molina: Finding Replicated Web Collections. SIGMOD Conference 2000: 355--366
[9]
S. Park, D. M. Pennock, C. L. Giles, R. Krovetz: Analysis of lexical signatures for improving information persistence on the World Wide Web. ACM Trans. Inf. Syst. 22(4): 540--572 (2004)
[10]
H. C. Davis: Referential Integrity of Links in Open Hypermedia Systems. Hypertext 1998: 207--216
[11]
H. C. Davis: Hypertext link integrity. ACM Comput. Surv. 31(4es): 28 (1999)
[12]
R. P. Dellavalle, E. J. hester, L. F. Heilig, A. L. Drake, J. W. Kuntzman, M. Graber, L. M. Schilling: Going, Going, Gone: Lost Internet References, Science 302(31), 2003: 787--788
[13]
X. Dong, A. Y. Halevy, J. Madhavan: Reference Reconciliation in Complex Information Spaces. SIGMOD Conference 2005: 85--96
[14]
D. Dhyani, W. K. Ng, S. S. Bhowmick: A survey of Web metrics. ACM Comput. Surv. 34(4): 469--503 (2002)
[15]
Roy T. Fielding: Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web. Computer Networks and ISDN Systems 27(2): 193--204 (1994)
[16]
David B. Ingham, Steve J. Caughey, Mark C. Little: Fixing the "Broken-Link" Problem: The W3Objects Approach. Computer Networks 28(7-11): 1255--1268 (1996)
[17]
Katsumi Tanaka, N. Nishikawa, S. Hirayama, K. Nanba: Query Pairs as Hypertext Links. ICDE 1991: 456--463.
[18]
Toshinari Iida, Natsumi Sawa, Atsuyuki Morishima, Shigeo Sugimoto, Hiroyuki Kitagawa. Efficient Search for Moved Web Pages. Proc. DEWS2007, 7 pages, 2007 (in Japanese).
[19]
Jon M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. SODA 1998: 668--677.
[20]
Google Technology. http://www.google.com/technology/.
[21]
GVU Center, College of Computing Georgia Institute of Technology. GVU's 10th WWW User Survey. http://www.gvu.gatech.edu/user_surveys/survey-1998-10/.
[22]
A. Mood, F. Graybill, D. Boes. Introduction to the theory of statistics. McGraw-Hill, 1974.
[23]
A. Morishima, et al. Automatic Correction of Broken Web Links (full version of this paper) Technical Report, University of Tsukuba.
[24]
Thomas A. Phelps, Robert Wilensky: Robust Hyperlinks: Cheap, Everywhere, Now. DDEP/PODDP 2000: 28--43
[25]
Persistent URL Home Page. http://purl.oclc.org/.
[26]
RFC2396. Uniform Resource Identifiers (URI): Generic Syntax. http://www.ietf.org/rfc/rfc2396.txt.
[27]
Jonathan Shakes, Marc Langheinrich, Oren Etzioni: Dynamic Reference Sifting: A Case Study in the Homepage Domain. Computer Networks 29(8-13): 1193--1204 (1997)
[28]
L. Huxley, E. Place, D. Boyd and P. Cross. Planet SOSIG - A spring-clean for SOSIG: a systematic approach to collection management. http://www.ariadne.ac.uk/issue33/planet-sosig/.
[29]
Ellen Spertus, Lynn Andrea Stein: Squeal: a structured query language for the Web. Computer Networks 33(1-6): 95--103 (2000)
[30]
Xenu's Link Sleuth. http://www.cs.washington.edu/lab/sw/LinkSleuth.html.

Cited By

View all
  • (2021)Link maintenance for integrity in linked open data evolutionSemantic Web10.3233/SW-20039812:3(517-541)Online publication date: 1-Jan-2021
  • (2018)A Review on fixing broken issues in linked dataPaper2018 International Conference on Soft-computing and Network Security (ICSNS)10.1109/ICSNS.2018.8573669(1-4)Online publication date: Feb-2018
  • (2014)Open annotations on multimedia Web resourcesMultimedia Tools and Applications10.1007/s11042-012-1098-970:2(847-867)Online publication date: 1-May-2014
  • Show More Cited By

Index Terms

  1. Bringing your dead links back to life: a comprehensive approach and lessons learned

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HT '09: Proceedings of the 20th ACM conference on Hypertext and hypermedia
    June 2009
    410 pages
    ISBN:9781605584867
    DOI:10.1145/1557914
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 June 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. broken links
    2. integrity management

    Qualifiers

    • Research-article

    Conference

    HT '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 378 of 1,158 submissions, 33%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Link maintenance for integrity in linked open data evolutionSemantic Web10.3233/SW-20039812:3(517-541)Online publication date: 1-Jan-2021
    • (2018)A Review on fixing broken issues in linked dataPaper2018 International Conference on Soft-computing and Network Security (ICSNS)10.1109/ICSNS.2018.8573669(1-4)Online publication date: Feb-2018
    • (2014)Open annotations on multimedia Web resourcesMultimedia Tools and Applications10.1007/s11042-012-1098-970:2(847-867)Online publication date: 1-May-2014
    • (2011)Towards designing an efficient crawling window to analysis and annotate changes in linked data sourcesProceedings of the 1st International Workshop on Linked Web Data Management10.1145/1966901.1966904(9-16)Online publication date: 25-Mar-2011
    • (2010)A more specific events classification to improve crawling techniquesProceedings of the 2010 international conference on On the move to meaningful internet systems10.5555/1948509.1948631(616-625)Online publication date: 25-Oct-2010
    • (2010)DSNotifyProceedings of the 19th international conference on World wide web10.1145/1772690.1772768(761-770)Online publication date: 26-Apr-2010
    • (2010)A More Specific Events Classification to Improve Crawling TechniquesOn the Move to Meaningful Internet Systems: OTM 2010 Workshops10.1007/978-3-642-16961-8_86(616-625)Online publication date: 2010
    • (undefined)DSNotify A Solution for Event Detection and Link Maintenance in Dynamic DatasetsSSRN Electronic Journal10.2139/ssrn.3199521

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media