ACM Home Page
Please provide us with feedback. Feedback
Robust web content extraction
Full text PdfPdf (191 KB)
Source International World Wide Web Conference archive
Proceedings of the 15th international conference on World Wide Web table of contents
Edinburgh, Scotland
POSTER SESSION: Browsers and UI, web engineering, hypermedia & multimedia, security, and accessibility table of contents
Pages: 887 - 888  
Year of Publication: 2006
ISBN:1-59593-323-9
Authors
Marek Kowalkiewicz  The Poznan University of Economics, Poznan, Poland
Maria E. Orlowska  The University of Queensland, St. Lucia, Australia
Tomasz Kaczmarek  The Poznan University of Economics, Poznan, Poland
Witold Abramowicz  The Poznan University of Economics, Poznan, Poland
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 83,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1135777.1135928
What is a DOI?

ABSTRACT

We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative XPath expressions, although not widely used, should be used in preference to absolute XPath expressions in extracting content from human-created Web documents. Evaluation of robustness covers four thousand queries executed on several hundred webpages. We show that in referencing parts of real world dynamic HTML documents, relative XPath expressions are on average significantly more robust than absolute XPath ones.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Kowalkiewicz, M., Orlowska, M., Kaczmarek, T. and Abramowicz, W. Towards more personalized Web: Extraction and integration of dynamic content from the Web. in Proceedings of the 8th Asia Pacific Web Conference APWeb 2006, Harbin, China, 2006.
3

Collaborative Colleagues:
Marek Kowalkiewicz: colleagues
Maria E. Orlowska: colleagues
Tomasz Kaczmarek: colleagues
Witold Abramowicz: colleagues