skip to main content
10.1145/1076034.1076079acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Title extraction from bodies of HTML documents and its application to web page retrieval

Published: 15 August 2005 Publication History

Abstract

This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6% improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1% -29.0% improvements).

References

[1]
Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A.Topic Distillation with Knowledge Agents, In Proceedings of the Eleventh Text REtrieval Conference (TREC-11), 2002.]]
[2]
Breuel, T.M. Information Extraction from HTML Documents by Structural Matching, In Proceedings of the Second International Workshop on Web Document Analysis (WDA2003), 2003.]]
[3]
Chidlovskii, B., Ragetli, J., and de Rijke, M. Wrapper Generation via Grammar Induction. In Proceedings of the Eleventh European Conference on Machine Learning (ECML2000), 2000.]]
[4]
Collins-Thompson, K., Ogilvie, P., Zhang, Y., and Callan, J. Information Filtering, Novelty Detection, and Named-Page Findiing, In Proceedings of the Eleventh Text Retrieval Conference (TREC-11), 2002.]]
[5]
Craswell, N. and Hawking, D. Overview of the TREC 2003 Web Track, In Proceedings of the Twelfth Text Retrieval Conference(TREC-2003), 2003.]]
[6]
Craven, T.C. HTML Tags as Extraction Cues for Web Page Description Construction, Informing Science Journal, Volume 6, 2003.]]
[7]
Crescenzi, V., Mecca, G. and Merialdo, P. Roadrunner: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the Twenty-seventh International Conference on Very Large Databases(VLDB2001), 2001.]]
[8]
Crescenzi, V., Mecca, G. and Merialdo, P. Wrapping-Oriented Classification of Web Pages. In Proceedings of the 2002 ACM Symposium on Applied Computing (SAC-2002), pages 1108--1112, 2002.]]
[9]
Cutler, M., Shih, T. and Meng, Y. Using the Structure of HTML Documents to Improve Retrieval, In Proceedings of the USENIX Symposium on Internet Technologies and Systems (NISTS'97), 1997.]]
[10]
Eikvil, L. Information Extraction from World Wide Web - A Survey. Technical Report 945, 1999.]]
[11]
Evans, D.K., Klavans, J.L. and McKeown, K.R. Columbia Newsblaster: Multilingual News Summarization on the Web. In Proceedings of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics(HLT-NAACL-2004), 2004.]]
[12]
Freitag, D. Machine Learning for Information Extraction in Informal Domains. Machine Learning, 39(2/3), pages 169--202, 2000.]]
[13]
Freitag, D. and McCallum, A. Information Extraction with HMMs and Shrinkage. In Proceedings of the AAAI'99 Workshop on Machine Learning for Information Extraction (AAAI'99), 1999.]]
[14]
Kosala, R., Bruynooghe, M., Bussche, J.V. and Blockeel, H. Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference, In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-2003), 2003.]]
[15]
Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J. and Kandola, J. The Perceptron Algorithm with Uneven Margin. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), 2002.]]
[16]
Liu, B., Grossman, R. and Zhai, Y. Mining Data Records in Web Pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), 2003.]]
[17]
Muslea, I., Minton, S. and Knoblock C. A Hierarchical Approach to Wrapper Induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), 1999.]]
[18]
Ogilvie, P. and Callan, J. Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding, In Proceedings of the Twelfth Text Retrieval Conference (TREC-12), 2003.]]
[19]
Ogilvie, P. and Callan, J. Combining Document Representations for Known-Item Search. In Proceedings of the Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'03), 2003.]]
[20]
Reis, D., Golgher, P., Silva, A. and Laender, A. Automatic Web News Extraction Using Tree Edit Distance. In Proceedings of International WWW Conference (WWW-2004), 2004.]]
[21]
Robertson, S., Zaragoza, H. and Taylor, M. Simple BM25 Extension to Multiple Weighted Fields. In Proceedings of ACM Thirteenth Conference on Information and Knowledge Management(CIKM-2004), 2004.]]
[22]
Song, R., Liu, H., Wen, J.-R. and Ma, W.Y. Learning Block Importance Models for Web Pages, In Proceedings of International WWW Conference (WWW-2004), 2004.]]
[23]
Song, R., Wen, J.-R., Shi, S., Xin, G., Liu, T.-Y., Qin, T., Zheng, X., Zhang, J., Xue, G., and Ma, W.-Y. Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004. In Proceedings of the Thirteenth Text REtrieval Conference Proceedings (TREC-2004), 2004.]]
[24]
Yau, H.S. and Hawker, J.S. SA_MetaMatch: Relevant Document Discovery Through Document Metadata and Indexing, In Proceedings of ACM Southeast Regional Conference 2004, 2004.]]
[25]
Zhang, M., Song, R., Lin, C., Ma, L., Jiang, Z., Jin, Y., Liu, Y., Zhao, L. and Ma, S. THU at TREC 2002: novelty, web, and filtering. In Proceedings of the Eleventh Text REtrieval Conference (TREC-11), 2002.]]
[26]
Zhang, M., Song, R. and Ma, S. DF or IDF? On the use of HTML primary feature fields for Web IR. In Proceedings of the Twelfth International World Wide Web Conference (WWW2003), 2003.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
August 2005
708 pages
ISBN:1595930345
DOI:10.1145/1076034
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 August 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HTML document
  2. information retrieval
  3. metadata extraction

Qualifiers

  • Article

Conference

SIGIR05
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Field features: The impact in learning to rank approachesApplied Soft Computing10.1016/j.asoc.2023.110183138(110183)Online publication date: May-2023
  • (2020)Automatically Discovering Relevant Images From Web PagesIEEE Access10.1109/ACCESS.2020.3039044(1-1)Online publication date: 2020
  • (2019)Unsupervised Keyphrase Extraction for Web PagesMultimodal Technologies and Interaction10.3390/mti30300583:3(58)Online publication date: 31-Jul-2019
  • (2019)Inferring Structure and Meaning of Semi-Structured Documents by using a Gibbs Sampling Based Approach2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)10.1109/ICDARW.2019.40100(169-174)Online publication date: Sep-2019
  • (2017)Using linguistic features to automatically extract web page titleExpert Systems with Applications: An International Journal10.1016/j.eswa.2017.02.04579:C(296-312)Online publication date: 15-Aug-2017
  • (2017)A learning framework for information block search based on probabilistic graphical models and Fisher KernelInternational Journal of Machine Learning and Cybernetics10.1007/s13042-017-0657-99:9(1473-1487)Online publication date: 28-Mar-2017
  • (2015)Automatic Web Content Extraction by Combination of Learning and GroupingProceedings of the 24th International Conference on World Wide Web10.1145/2736277.2741659(1264-1274)Online publication date: 18-May-2015
  • (2015)Determining the Relative Importance of Webpages Based on Social Signals Using the Social Score and the Potential Role of the Social Score in an Asynchronous Social Search EngineKnowledge Discovery, Knowledge Engineering and Knowledge Management10.1007/978-3-319-25840-9_8(118-131)Online publication date: 28-Oct-2015
  • (2014)How can catchy titles be generated without loss of informativeness?Expert Systems with Applications: An International Journal10.1016/j.eswa.2013.07.10241:4(1051-1062)Online publication date: 1-Mar-2014
  • (2014)Unsupervised Analysis of Web Page Semantic Structures by Hierarchical Bayesian ModelingAdvances in Knowledge Discovery and Data Mining10.1007/978-3-319-06605-9_47(572-583)Online publication date: 2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media