research-article

CETR: content extraction via tag ratios

Authors:

William H. Hsu,

Jiawei HanAuthors Info & Claims

WWW '10: Proceedings of the 19th international conference on World wide web

Pages 971 - 980

https://doi.org/10.1145/1772690.1772789

Published: 26 April 2010 Publication History

Abstract

We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.

References

[1]

19th International Workshop on Database and Expert Systems Applications (DEXA 2008), 1-5 September 2008, Turin, Italy. IEEE Computer Society, 2008.

[2]

B. Adelberg. Nodose - a tool for semi-automatically extracting semi-structured data from text documents. In SIGMOD Conference, pages 283--294. ACM Press, 1998.

Digital Library

[3]

Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In WWW, pages 580--591, 2002.

Digital Library

[4]

O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Accordion summarization for end-game browsing on pdas and cellular phones. In CHI, pages 213--220, 2001.

Digital Library

[5]

D. Cai, X. He, J.-R.Wen, andW.-Y. Ma. Block-level link analysis. In SIGIR, pages 440--447. ACM, 2004.

Digital Library

[6]

D. Cai, S. Yu, J.-R.Wen, andW.-Y. Ma. Extracting content structure for web pages based on visual representation. In APWeb, volume 2642 of Lecture Notes in Computer Science, pages 406--417. Springer, 2003.

Digital Library

[7]

R. Cathey, L. Ma, N. Goharian, and D. A. Grossman. Misuse detection for information retrieval systems. In CIKM, pages 183--190. ACM, 2003.

Digital Library

[8]

J. Chen, B. Zhou, and H. Zhang. Function-based object model towards website adaptation. In In Proceedings of the 10th International World Wide Web Conference, pages 587--596. ACM Press, 2001.

Digital Library

[9]

L. Chen, S. Ye, and X. Li. Template detection for large scale search engines. In SAC, pages 1094--1098. ACM, 2006.

Digital Library

[10]

D. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender. Automatic web news extraction using tree edit distance. In WWW, pages 502--511. ACM, 2004.

Digital Library

[11]

S. Debnath, P. Mitra, and C. L. Giles. Automatic extraction of informative blocks from webpages. In SAC, pages 1722--1726. ACM, 2005.

Digital Library

[12]

S. Debnath, P. Mitra, and C. L. Giles. Identifying content blocks from web documents. In ISMIS, volume 3488 of Lecture Notes in Computer Science, pages 285--293. Springer, 2005.

Digital Library

[13]

A. Finn, N. Kushmerick, and B. Smyth. Fact or fiction: Content classification for digital libraries. In DELOS Workshop: Personalization and Recommender Systems in Digital Libraries, 2001.

[14]

T. Gottron. Evaluating content extraction on html documents. In ITA, pages 123--132, 2007.

[15]

T. Gottron. Combining content extraction heuristics: the ombine system. In iiWAS, pages 591--595. ACM, 2008.

Digital Library

[16]

T. Gottron. Content code blurring: A new approach to content extraction. In DEXA Workshops {1}, pages 29--33.

Digital Library

[17]

S. Gupta, G. E. Kaiser, P. Grimm, M. F. Chiang, and J. Starren. Automating content extraction of html documents. World Wide Web, 8(2):179--224, 2005.

Digital Library

[18]

S. Gupta, G. E. Kaiser, D. Neistadt, and P. Grimm. Dom-based content extraction of html documents. In WWW, pages 207--214, 2003.

Digital Library

[19]

S. Gupta, G. E. Kaiser, and S. J. Stolfo. Extracting context to improve accuracy for html content extraction. In WWW (Special interest tracks and posters), pages 1114--1115. ACM, 2005.

Digital Library

[20]

W. Han, D. Buttler, and C. Pu. Wrapping web data into xml. SIGMOD Rec., 30(3):33--38, 2001.

Digital Library

[21]

H.-Y. Kao, S.-H. Lin, J.-M. Ho, and M.-S. Chen. Mining web informative structures and contents based on entropy analysis. IEEE Trans. Knowl. Data Eng., 16(1):41--55, 2004.

Digital Library

[22]

N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118(1-2):15--68, 2000.

Digital Library

[23]

S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. In KDD, pages 588--593. ACM, 2002.

Digital Library

[24]

J. MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematics Statistics and Probability, pages 281--297, 1967.

[25]

C. Mantratzis, M. A. Orgun, and S. Cassidy. Separating xhtml content from navigation clutter using dom-structure block analysis. In S. Reich and M. Tzagarakis, editors, Hypertext, pages 145--147. ACM, 2005.

Digital Library

[26]

M. Marek, P. Pecina, and M. Spousta. Template detection through conditional random fields. In WAC3, 2007.

[27]

I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1-2):93--114, 2001.

Digital Library

[28]

J. Pasternack and D. Roth. Extracting article text from the web with maximum subsequence segmentation. In WWW, pages 971--980. ACM, 2009.

Digital Library

[29]

D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. Quasm: a system for question answering using semi-structured data. In JCDL, pages 46--55. ACM, 2002.

Digital Library

[30]

A. F. R. Rahman, H. Alam, and R. Hartono. Content extraction from html documents. In WDA, pages 7--10, 2001.

[31]

T. V. Raman. Toward 2w, beyond web 2.0. Commun. ACM, 52(2):52--59, 2009.

Digital Library

[32]

T. Weninger and W. H. Hsu. Text extraction from the web via text-to-tag ratio. In DEXA Workshops {1}, pages 23--28.

Digital Library

[33]

L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In KDD, pages 296--305. ACM, 2003.

Digital Library

Cited By

Salem HSalloum HMazzara M(2025)Mathematical Model and Algorithm for Accurate Main Content Extraction From News WebsitesIEEE Access10.1109/ACCESS.2024.352465613(15694-15711)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2024.3524656
Alarte JGalindo CMartín CSilva JSerra ESpezzano F(2024)RevEx: An Online Consumer Reviews Extraction ToolProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679214(5169-5173)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679214
Zhong LWu JLi QPeng HWu X(2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 5-Sep-2023
https://dl.acm.org/doi/10.1145/3618295
Show More Cited By

Index Terms

CETR: content extraction via tag ratios

Recommendations

DOM-based content extraction of HTML documents
WWW '03: Proceedings of the 12th international conference on World Wide Web

Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, ...
Web news extraction via path ratios
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

In addition to the news content, most web news pages also contain navigation panels, advertisements, related news links etc. These non-news items not only exist outside the news region, but are also present in the news content region. Effectively ...
Content Code Blurring: A New Approach to Content Extraction
DEXA '08: Proceedings of the 2008 19th International Conference on Database and Expert Systems Application

Most HTML documents on the World Wide Web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content Extraction is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '10: Proceedings of the 19th international conference on World wide web

April 2010

1407 pages

ISBN:9781605587998

DOI:10.1145/1772690

General Chairs:
Michael Rappa
North Carolina State University, USA
,
Paul Jones
University of North Carolina at Chapel Hill, USA
,
Program Chairs:
Juliana Freire
University of Utah, USA
,
Soumen Chakrabarti
Indian Institute of Technology, India

Copyright © 2010 International World Wide Web Conference Committee (IW3C2).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 April 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '10

WWW '10: The 19th International World Wide Web Conference

April 26 - 30, 2010

North Carolina, Raleigh, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

88
Total Citations
View Citations
862
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Salem HSalloum HMazzara M(2025)Mathematical Model and Algorithm for Accurate Main Content Extraction From News WebsitesIEEE Access10.1109/ACCESS.2024.352465613(15694-15711)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2024.3524656
Alarte JGalindo CMartín CSilva JSerra ESpezzano F(2024)RevEx: An Online Consumer Reviews Extraction ToolProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679214(5169-5173)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679214
Zhong LWu JLi QPeng HWu X(2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 5-Sep-2023
https://dl.acm.org/doi/10.1145/3618295
Bevendorff JGupta SKiesel JStein BChen HDuh WHuang HKato MMothe JPoblete B(2023)An Empirical Comparison of Web Content Extraction AlgorithmsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591920(2594-2603)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591920
Chen ZZhou JSun R(2023)An efficient content extraction method for webpage based on tag-line-block analysisSoft Computing10.1007/s00500-023-09076-x27:20(14631-14645)Online publication date: 24-Aug-2023
https://doi.org/10.1007/s00500-023-09076-x
Yeoh BWang HAl Hasan MXiong L(2022)GROWN+UPProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557340(2372-2382)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557340
Li HZhou JSun W(2022)Military Web Information Extraction Incorporating Resource Metadata Distribution2022 2nd International Conference on Networking Systems of AI (INSAI)10.1109/INSAI56792.2022.00019(46-50)Online publication date: Oct-2022
https://doi.org/10.1109/INSAI56792.2022.00019
Alarte JSilva J(2021)Page-Level Main Content Extraction From Heterogeneous WebpagesACM Transactions on Knowledge Discovery from Data10.1145/345116815:6(1-105)Online publication date: 28-Jun-2021
https://dl.acm.org/doi/10.1145/3451168
Zhang HWang J(2021)Boilerplate Detection via Semantic Classification of TextBlocks2021 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN52387.2021.9534308(1-8)Online publication date: 18-Jul-2021
https://doi.org/10.1109/IJCNN52387.2021.9534308
Nigam HBiswas P(2021)From Web Scraping to Web CrawlingApplications of Artificial Intelligence and Machine Learning10.1007/978-981-16-3067-5_9(97-112)Online publication date: 27-Jul-2021
https://doi.org/10.1007/978-981-16-3067-5_9
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

EPUB

View this article in ePub.

Figures

Tables

Media

View Table of Conten