skip to main content
10.1145/1526709.1526909acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
poster

A densitometric analysis of web template content

Published: 20 April 2009 Publication History

Abstract

What makes template content in the Web so special that we need to remove it? In this paper I present a large-scale aggregate analysis of textual Web content, corroborating statistical laws from the field of Quantitative Linguistics. I analyze the idiosyncrasy of template content compared to regular "full text" content and derive a simple yet suitable quantitative model.

References

[1]
Gabriel Altmann. Das Problem der Datenhomogenität. In Glottometrika 13. Brockmeyer, 1992.
[2]
D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In WWW'05.
[3]
Chr. Kohlschütter and W. Nejdl. A Densitometric Approach to Web Page Segmentation. In CIKM 2008.
[4]
D. Lavalette. A general purpose ranking variable with applications to various ranking laws. In Exact Methods in the Study of Language and Text. 2007.

Cited By

View all
  • (2023)An unsupervised perplexity-based method for boilerplate removalNatural Language Engineering10.1017/S1351324923000049(1-18)Online publication date: 21-Feb-2023
  • (2021)Page-Level Main Content Extraction From Heterogeneous WebpagesACM Transactions on Knowledge Discovery from Data10.1145/345116815:6(1-105)Online publication date: 28-Jun-2021
  • (2019)What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template ExtractorsACM Transactions on the Web10.1145/331681013:2(1-19)Online publication date: 27-Mar-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '09: Proceedings of the 18th international conference on World wide web
April 2009
1280 pages
ISBN:9781605584874
DOI:10.1145/1526709

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content analysis
  2. noise removal
  3. template detection
  4. template removal
  5. web page segmentation

Qualifiers

  • Poster

Conference

WWW '09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)An unsupervised perplexity-based method for boilerplate removalNatural Language Engineering10.1017/S1351324923000049(1-18)Online publication date: 21-Feb-2023
  • (2021)Page-Level Main Content Extraction From Heterogeneous WebpagesACM Transactions on Knowledge Discovery from Data10.1145/345116815:6(1-105)Online publication date: 28-Jun-2021
  • (2019)What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template ExtractorsACM Transactions on the Web10.1145/331681013:2(1-19)Online publication date: 27-Mar-2019
  • (2019)Web-AM: An Efficient Boilerplate Removal Algorithm for Web Articles2019 International Conference on Frontiers of Information Technology (FIT)10.1109/FIT47737.2019.00061(287-2875)Online publication date: Dec-2019
  • (2019)An effective and efficient Web content extractor for optimizing the crawling processSoftware—Practice & Experience10.1002/spe.219544:10(1181-1199)Online publication date: 4-Jan-2019
  • (2018)Web2Text: Deep Structured Boilerplate RemovalAdvances in Information Retrieval10.1007/978-3-319-76941-7_13(167-179)Online publication date: 1-Mar-2018
  • (2018)Main Content Extraction from Heterogeneous WebpagesWeb Information Systems Engineering – WISE 201810.1007/978-3-030-02922-7_27(393-407)Online publication date: 20-Oct-2018
  • (2017)Webpage Menu Detection Based on DOMSOFSEM 2017: Theory and Practice of Computer Science10.1007/978-3-319-51963-0_32(411-422)Online publication date: 11-Jan-2017
  • (2016)Site-Level Web Template Extraction Based on DOM AnalysisPerspectives of System Informatics10.1007/978-3-319-41579-6_4(36-49)Online publication date: 28-Jun-2016
  • (2015)Web Template Extraction Based on Hyperlink AnalysisElectronic Proceedings in Theoretical Computer Science10.4204/EPTCS.173.2173(16-26)Online publication date: 8-Jan-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media