Article

Robust web content extraction

Authors:

Marek Kowalkiewicz,

Maria E. Orlowska,

Tomasz Kaczmarek,

Witold AbramowiczAuthors Info & Claims

WWW '06: Proceedings of the 15th international conference on World Wide Web

Pages 887 - 888

https://doi.org/10.1145/1135777.1135928

Published: 23 May 2006 Publication History

Get Access

Abstract

We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative XPath expressions, although not widely used, should be used in preference to absolute XPath expressions in extracting content from human-created Web documents. Evaluation of robustness covers four thousand queries executed on several hundred webpages. We show that in referencing parts of real world dynamic HTML documents, relative XPath expressions are on average significantly more robust than absolute XPath ones.

References

[1]

Abe, M. and Hori, M. Robust Pointing by XPath Language: Authoring Support and Empirical Evaluation. in Proceedings of 2003 Symposium on Applications and the Internet (SAINT 2003), 27--31 January 2003, IEEE Computer Society, Orlando, FL, USA, 2003, 156--165.

Digital Library

Google Scholar

[2]

Kowalkiewicz, M., Orlowska, M., Kaczmarek, T. and Abramowicz, W. Towards more personalized Web: Extraction and integration of dynamic content from the Web. in Proceedings of the 8th Asia Pacific Web Conference APWeb 2006, Harbin, China, 2006.

Digital Library

Google Scholar

[3]

Laender, A.H.F., Ribeiro-Neto, B.A., Silva, A.S.d. and Teixeira, J.S. A brief survey of web data extraction tools. ACM SIGMOD Record, 31 (2). 84--93.

Digital Library

Google Scholar

Cited By

View all

De Luca MFasolino ATramontana P(2024)Investigating the robustness of locators in template-based Web application testing using a GUI change classification modelJournal of Systems and Software10.1016/j.jss.2023.111932210:COnline publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1016/j.jss.2023.111932
Singh* TH. P(2021)Test Case Recording using JavaScript for Automation TestingInternational Journal of Recent Technology and Engineering10.35940/ijrte.A5810.051012110:1(153-157)Online publication date: 30-May-2021
https://doi.org/10.35940/ijrte.A5810.0510121
Parameswaran ADalvi NGarcia-Molina HRastogi R(2020)Optimal schemes for robust web extractionProceedings of the VLDB Endowment10.14778/3402707.34027354:11(980-991)Online publication date: 3-Jun-2020
https://doi.org/10.14778/3402707.3402735
Show More Cited By

Index Terms

Robust web content extraction
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Multi / mixed media creation
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Hypertext / hypermedia

Recommendations

DOM-based content extraction of HTML documents
WWW '03: Proceedings of the 12th international conference on World Wide Web

Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, ...
Automating Content Extraction of HTML Documents

Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell ...
Extracting Web Content by Exploiting Multi-Category Characteristics
Web Information Systems Engineering – WISE 2017
Abstract
Extracting web content aims at separating web content from web pages since web content is organized and presented by different HTML templates and is surrounded by various information. Knowing little about template structures and noise information ...

Comments

Information & Contributors

Information

Published In

WWW '06: Proceedings of the 15th international conference on World Wide Web

May 2006

1102 pages

ISBN:1595933239

DOI:10.1145/1135777

General Chairs:
Leslie Carr
University of Southampton
,
David De Roure
University of Southampton
,
Arun Iyengar
IBM Research
,
Program Chairs:
Carole Goble
University of Manchester, UK
,
Mike Dahlin
University of Texas at Austin

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW06

Sponsor:

WWW06: The 15th International World Wide Web Conference 2006

May 23 - 26, 2006

Edinburgh, Scotland

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
478
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

De Luca MFasolino ATramontana P(2024)Investigating the robustness of locators in template-based Web application testing using a GUI change classification modelJournal of Systems and Software10.1016/j.jss.2023.111932210:COnline publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1016/j.jss.2023.111932
Singh* TH. P(2021)Test Case Recording using JavaScript for Automation TestingInternational Journal of Recent Technology and Engineering10.35940/ijrte.A5810.051012110:1(153-157)Online publication date: 30-May-2021
https://doi.org/10.35940/ijrte.A5810.0510121
Parameswaran ADalvi NGarcia-Molina HRastogi R(2020)Optimal schemes for robust web extractionProceedings of the VLDB Endowment10.14778/3402707.34027354:11(980-991)Online publication date: 3-Jun-2020
https://doi.org/10.14778/3402707.3402735
Leotta MStocco ARicca FTonella P(2016)Robula+Journal of Software: Evolution and Process10.1002/smr.177128:3(177-204)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1002/smr.1771
Leotta MStocco ARicca FTonella P(2014)Reducing Web Test Cases Aging by Means of Robust XPath LocatorsProceedings of the 2014 IEEE International Symposium on Software Reliability Engineering Workshops10.1109/ISSREW.2014.17(449-454)Online publication date: 3-Nov-2014
https://dl.acm.org/doi/10.1109/ISSREW.2014.17
Sellers ASadagopan SRamamritham KKumar ARavindra MBertino EKumar R(2011)The OXPath to success in the deep webProceedings of the 20th international conference companion on World wide web10.1145/1963192.1963352(409-414)Online publication date: 28-Mar-2011
https://dl.acm.org/doi/10.1145/1963192.1963352
Dalvi NBohannon PSha FÇetintemel UZdonik SKossmann D(2009)Robust web extractionProceedings of the 2009 ACM SIGMOD International Conference on Management of data10.1145/1559845.1559882(335-348)Online publication date: 29-Jun-2009
https://dl.acm.org/doi/10.1145/1559845.1559882
Juffinger AGranitzer MLex ETanaka KZhou XJatowt A(2009)Blog credibility ranking by exploiting verified contentProceedings of the 3rd workshop on Information credibility on the web10.1145/1526993.1527005(51-58)Online publication date: 20-Apr-2009
https://dl.acm.org/doi/10.1145/1526993.1527005
Juffinger ALex EQuemada JLeón GMaarek YNejdl W(2009)Crosslanguage blog mining and trend visualisationProceedings of the 18th international conference on World wide web10.1145/1526709.1526901(1149-1150)Online publication date: 20-Apr-2009
https://dl.acm.org/doi/10.1145/1526709.1526901
Janiesch CFleischmann KDreiling A(2008)Extending Services Delivery with Lightweight CompositionProceedings of the 2008 international workshops on Web Information Systems Engineering10.1007/978-3-540-85200-1_19(162-171)Online publication date: 1-Sep-2008
https://dl.acm.org/doi/10.1007/978-3-540-85200-1_19

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

DOM-based content extraction of HTML documents

Automating Content Extraction of HTML Documents

Extracting Web Content by Exploiting Multi-Category Characteristics

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations