skip to main content
10.1145/1135777.1135928acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Robust web content extraction

Published: 23 May 2006 Publication History

Abstract

We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative XPath expressions, although not widely used, should be used in preference to absolute XPath expressions in extracting content from human-created Web documents. Evaluation of robustness covers four thousand queries executed on several hundred webpages. We show that in referencing parts of real world dynamic HTML documents, relative XPath expressions are on average significantly more robust than absolute XPath ones.

References

[1]
Abe, M. and Hori, M. Robust Pointing by XPath Language: Authoring Support and Empirical Evaluation. in Proceedings of 2003 Symposium on Applications and the Internet (SAINT 2003), 27--31 January 2003, IEEE Computer Society, Orlando, FL, USA, 2003, 156--165.
[2]
Kowalkiewicz, M., Orlowska, M., Kaczmarek, T. and Abramowicz, W. Towards more personalized Web: Extraction and integration of dynamic content from the Web. in Proceedings of the 8th Asia Pacific Web Conference APWeb 2006, Harbin, China, 2006.
[3]
Laender, A.H.F., Ribeiro-Neto, B.A., Silva, A.S.d. and Teixeira, J.S. A brief survey of web data extraction tools. ACM SIGMOD Record, 31 (2). 84--93.

Cited By

View all
  • (2024)Investigating the robustness of locators in template-based Web application testing using a GUI change classification modelJournal of Systems and Software10.1016/j.jss.2023.111932210:COnline publication date: 1-Apr-2024
  • (2021)Test Case Recording using JavaScript for Automation TestingInternational Journal of Recent Technology and Engineering10.35940/ijrte.A5810.051012110:1(153-157)Online publication date: 30-May-2021
  • (2020)Optimal schemes for robust web extractionProceedings of the VLDB Endowment10.14778/3402707.34027354:11(980-991)Online publication date: 3-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '06: Proceedings of the 15th international conference on World Wide Web
May 2006
1102 pages
ISBN:1595933239
DOI:10.1145/1135777
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content extraction
  2. evaluation
  3. robustness
  4. wrappers

Qualifiers

  • Article

Conference

WWW06
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Investigating the robustness of locators in template-based Web application testing using a GUI change classification modelJournal of Systems and Software10.1016/j.jss.2023.111932210:COnline publication date: 1-Apr-2024
  • (2021)Test Case Recording using JavaScript for Automation TestingInternational Journal of Recent Technology and Engineering10.35940/ijrte.A5810.051012110:1(153-157)Online publication date: 30-May-2021
  • (2020)Optimal schemes for robust web extractionProceedings of the VLDB Endowment10.14778/3402707.34027354:11(980-991)Online publication date: 3-Jun-2020
  • (2016)Robula+Journal of Software: Evolution and Process10.1002/smr.177128:3(177-204)Online publication date: 1-Mar-2016
  • (2014)Reducing Web Test Cases Aging by Means of Robust XPath LocatorsProceedings of the 2014 IEEE International Symposium on Software Reliability Engineering Workshops10.1109/ISSREW.2014.17(449-454)Online publication date: 3-Nov-2014
  • (2011)The OXPath to success in the deep webProceedings of the 20th international conference companion on World wide web10.1145/1963192.1963352(409-414)Online publication date: 28-Mar-2011
  • (2009)Robust web extractionProceedings of the 2009 ACM SIGMOD International Conference on Management of data10.1145/1559845.1559882(335-348)Online publication date: 29-Jun-2009
  • (2009)Blog credibility ranking by exploiting verified contentProceedings of the 3rd workshop on Information credibility on the web10.1145/1526993.1527005(51-58)Online publication date: 20-Apr-2009
  • (2009)Crosslanguage blog mining and trend visualisationProceedings of the 18th international conference on World wide web10.1145/1526709.1526901(1149-1150)Online publication date: 20-Apr-2009
  • (2008)Extending Services Delivery with Lightweight CompositionProceedings of the 2008 international workshops on Web Information Systems Engineering10.1007/978-3-540-85200-1_19(162-171)Online publication date: 1-Sep-2008

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media