skip to main content
10.1145/1772690.1772789acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

CETR: content extraction via tag ratios

Published: 26 April 2010 Publication History

Abstract

We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.

References

[1]
19th International Workshop on Database and Expert Systems Applications (DEXA 2008), 1-5 September 2008, Turin, Italy. IEEE Computer Society, 2008.
[2]
B. Adelberg. Nodose - a tool for semi-automatically extracting semi-structured data from text documents. In SIGMOD Conference, pages 283--294. ACM Press, 1998.
[3]
Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In WWW, pages 580--591, 2002.
[4]
O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Accordion summarization for end-game browsing on pdas and cellular phones. In CHI, pages 213--220, 2001.
[5]
D. Cai, X. He, J.-R.Wen, andW.-Y. Ma. Block-level link analysis. In SIGIR, pages 440--447. ACM, 2004.
[6]
D. Cai, S. Yu, J.-R.Wen, andW.-Y. Ma. Extracting content structure for web pages based on visual representation. In APWeb, volume 2642 of Lecture Notes in Computer Science, pages 406--417. Springer, 2003.
[7]
R. Cathey, L. Ma, N. Goharian, and D. A. Grossman. Misuse detection for information retrieval systems. In CIKM, pages 183--190. ACM, 2003.
[8]
J. Chen, B. Zhou, and H. Zhang. Function-based object model towards website adaptation. In In Proceedings of the 10th International World Wide Web Conference, pages 587--596. ACM Press, 2001.
[9]
L. Chen, S. Ye, and X. Li. Template detection for large scale search engines. In SAC, pages 1094--1098. ACM, 2006.
[10]
D. de Castro Reis, P. B. Golgher, A. S. da Silva, and A. H. F. Laender. Automatic web news extraction using tree edit distance. In WWW, pages 502--511. ACM, 2004.
[11]
S. Debnath, P. Mitra, and C. L. Giles. Automatic extraction of informative blocks from webpages. In SAC, pages 1722--1726. ACM, 2005.
[12]
S. Debnath, P. Mitra, and C. L. Giles. Identifying content blocks from web documents. In ISMIS, volume 3488 of Lecture Notes in Computer Science, pages 285--293. Springer, 2005.
[13]
A. Finn, N. Kushmerick, and B. Smyth. Fact or fiction: Content classification for digital libraries. In DELOS Workshop: Personalization and Recommender Systems in Digital Libraries, 2001.
[14]
T. Gottron. Evaluating content extraction on html documents. In ITA, pages 123--132, 2007.
[15]
T. Gottron. Combining content extraction heuristics: the ombine system. In iiWAS, pages 591--595. ACM, 2008.
[16]
T. Gottron. Content code blurring: A new approach to content extraction. In DEXA Workshops {1}, pages 29--33.
[17]
S. Gupta, G. E. Kaiser, P. Grimm, M. F. Chiang, and J. Starren. Automating content extraction of html documents. World Wide Web, 8(2):179--224, 2005.
[18]
S. Gupta, G. E. Kaiser, D. Neistadt, and P. Grimm. Dom-based content extraction of html documents. In WWW, pages 207--214, 2003.
[19]
S. Gupta, G. E. Kaiser, and S. J. Stolfo. Extracting context to improve accuracy for html content extraction. In WWW (Special interest tracks and posters), pages 1114--1115. ACM, 2005.
[20]
W. Han, D. Buttler, and C. Pu. Wrapping web data into xml. SIGMOD Rec., 30(3):33--38, 2001.
[21]
H.-Y. Kao, S.-H. Lin, J.-M. Ho, and M.-S. Chen. Mining web informative structures and contents based on entropy analysis. IEEE Trans. Knowl. Data Eng., 16(1):41--55, 2004.
[22]
N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118(1-2):15--68, 2000.
[23]
S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. In KDD, pages 588--593. ACM, 2002.
[24]
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematics Statistics and Probability, pages 281--297, 1967.
[25]
C. Mantratzis, M. A. Orgun, and S. Cassidy. Separating xhtml content from navigation clutter using dom-structure block analysis. In S. Reich and M. Tzagarakis, editors, Hypertext, pages 145--147. ACM, 2005.
[26]
M. Marek, P. Pecina, and M. Spousta. Template detection through conditional random fields. In WAC3, 2007.
[27]
I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1-2):93--114, 2001.
[28]
J. Pasternack and D. Roth. Extracting article text from the web with maximum subsequence segmentation. In WWW, pages 971--980. ACM, 2009.
[29]
D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. Quasm: a system for question answering using semi-structured data. In JCDL, pages 46--55. ACM, 2002.
[30]
A. F. R. Rahman, H. Alam, and R. Hartono. Content extraction from html documents. In WDA, pages 7--10, 2001.
[31]
T. V. Raman. Toward 2w, beyond web 2.0. Commun. ACM, 52(2):52--59, 2009.
[32]
T. Weninger and W. H. Hsu. Text extraction from the web via text-to-tag ratio. In DEXA Workshops {1}, pages 23--28.
[33]
L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In KDD, pages 296--305. ACM, 2003.

Cited By

View all
  • (2025)Mathematical Model and Algorithm for Accurate Main Content Extraction From News WebsitesIEEE Access10.1109/ACCESS.2024.352465613(15694-15711)Online publication date: 2025
  • (2024)RevEx: An Online Consumer Reviews Extraction ToolProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679214(5169-5173)Online publication date: 21-Oct-2024
  • (2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 5-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '10: Proceedings of the 19th international conference on World wide web
April 2010
1407 pages
ISBN:9781605587998
DOI:10.1145/1772690

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 April 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content extraction
  2. tag ratio
  3. world wide web

Qualifiers

  • Research-article

Conference

WWW '10
WWW '10: The 19th International World Wide Web Conference
April 26 - 30, 2010
North Carolina, Raleigh, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Mathematical Model and Algorithm for Accurate Main Content Extraction From News WebsitesIEEE Access10.1109/ACCESS.2024.352465613(15694-15711)Online publication date: 2025
  • (2024)RevEx: An Online Consumer Reviews Extraction ToolProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679214(5169-5173)Online publication date: 21-Oct-2024
  • (2023)A Comprehensive Survey on Automatic Knowledge Graph ConstructionACM Computing Surveys10.1145/361829556:4(1-62)Online publication date: 5-Sep-2023
  • (2023)An Empirical Comparison of Web Content Extraction AlgorithmsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591920(2594-2603)Online publication date: 19-Jul-2023
  • (2023)An efficient content extraction method for webpage based on tag-line-block analysisSoft Computing10.1007/s00500-023-09076-x27:20(14631-14645)Online publication date: 24-Aug-2023
  • (2022)GROWN+UPProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557340(2372-2382)Online publication date: 17-Oct-2022
  • (2022)Military Web Information Extraction Incorporating Resource Metadata Distribution2022 2nd International Conference on Networking Systems of AI (INSAI)10.1109/INSAI56792.2022.00019(46-50)Online publication date: Oct-2022
  • (2021)Page-Level Main Content Extraction From Heterogeneous WebpagesACM Transactions on Knowledge Discovery from Data10.1145/345116815:6(1-105)Online publication date: 28-Jun-2021
  • (2021)Boilerplate Detection via Semantic Classification of TextBlocks2021 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN52387.2021.9534308(1-8)Online publication date: 18-Jul-2021
  • (2021)From Web Scraping to Web CrawlingApplications of Artificial Intelligence and Machine Learning10.1007/978-981-16-3067-5_9(97-112)Online publication date: 27-Jul-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

EPUB

View this article in ePub.

ePub

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media