Article

Title extraction from bodies of HTML documents and its application to web page retrieval

Authors:
Yunhua Hu

Xi'an Jiaotong University, Xi'an, China

Xi'an Jiaotong University, Xi'an, China
View Profile

,
Guomao Xin

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Ruihua Song

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Guoping Hu

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Shuming Shi

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Yunbo Cao

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Hang Li

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2005Pages 250–257https://doi.org/10.1145/1076034.1076079

Published:15 August 2005Publication History

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 250–257

ABSTRACT

This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6% improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1% -29.0% improvements).

References

Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A.Topic Distillation with Knowledge Agents, In Proceedings of the Eleventh Text REtrieval Conference (TREC-11), 2002.]]Google Scholar
Breuel, T.M. Information Extraction from HTML Documents by Structural Matching, In Proceedings of the Second International Workshop on Web Document Analysis (WDA2003), 2003.]]Google Scholar
Chidlovskii, B., Ragetli, J., and de Rijke, M. Wrapper Generation via Grammar Induction. In Proceedings of the Eleventh European Conference on Machine Learning (ECML2000), 2000.]] Google ScholarDigital Library
Collins-Thompson, K., Ogilvie, P., Zhang, Y., and Callan, J. Information Filtering, Novelty Detection, and Named-Page Findiing, In Proceedings of the Eleventh Text Retrieval Conference (TREC-11), 2002.]]Google Scholar
Craswell, N. and Hawking, D. Overview of the TREC 2003 Web Track, In Proceedings of the Twelfth Text Retrieval Conference(TREC-2003), 2003.]]Google Scholar
Craven, T.C. HTML Tags as Extraction Cues for Web Page Description Construction, Informing Science Journal, Volume 6, 2003.]]Google Scholar
Crescenzi, V., Mecca, G. and Merialdo, P. Roadrunner: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the Twenty-seventh International Conference on Very Large Databases(VLDB2001), 2001.]] Google ScholarDigital Library
Crescenzi, V., Mecca, G. and Merialdo, P. Wrapping-Oriented Classification of Web Pages. In Proceedings of the 2002 ACM Symposium on Applied Computing (SAC-2002), pages 1108--1112, 2002.]] Google ScholarDigital Library
Cutler, M., Shih, T. and Meng, Y. Using the Structure of HTML Documents to Improve Retrieval, In Proceedings of the USENIX Symposium on Internet Technologies and Systems (NISTS'97), 1997.]] Google ScholarDigital Library
Eikvil, L. Information Extraction from World Wide Web - A Survey. Technical Report 945, 1999.]]Google Scholar
Evans, D.K., Klavans, J.L. and McKeown, K.R. Columbia Newsblaster: Multilingual News Summarization on the Web. In Proceedings of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics(HLT-NAACL-2004), 2004.]] Google ScholarDigital Library
Freitag, D. Machine Learning for Information Extraction in Informal Domains. Machine Learning, 39(2/3), pages 169--202, 2000.]] Google ScholarDigital Library
Freitag, D. and McCallum, A. Information Extraction with HMMs and Shrinkage. In Proceedings of the AAAI'99 Workshop on Machine Learning for Information Extraction (AAAI'99), 1999.]] Google ScholarDigital Library
Kosala, R., Bruynooghe, M., Bussche, J.V. and Blockeel, H. Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference, In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-2003), 2003.]] Google ScholarDigital Library
Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J. and Kandola, J. The Perceptron Algorithm with Uneven Margin. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), 2002.]] Google ScholarDigital Library
Liu, B., Grossman, R. and Zhai, Y. Mining Data Records in Web Pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), 2003.]] Google ScholarDigital Library
Muslea, I., Minton, S. and Knoblock C. A Hierarchical Approach to Wrapper Induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), 1999.]] Google ScholarDigital Library
Ogilvie, P. and Callan, J. Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding, In Proceedings of the Twelfth Text Retrieval Conference (TREC-12), 2003.]]Google Scholar
Ogilvie, P. and Callan, J. Combining Document Representations for Known-Item Search. In Proceedings of the Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'03), 2003.]] Google ScholarDigital Library
Reis, D., Golgher, P., Silva, A. and Laender, A. Automatic Web News Extraction Using Tree Edit Distance. In Proceedings of International WWW Conference (WWW-2004), 2004.]] Google ScholarDigital Library
Robertson, S., Zaragoza, H. and Taylor, M. Simple BM25 Extension to Multiple Weighted Fields. In Proceedings of ACM Thirteenth Conference on Information and Knowledge Management(CIKM-2004), 2004.]] Google ScholarDigital Library
Song, R., Liu, H., Wen, J.-R. and Ma, W.Y. Learning Block Importance Models for Web Pages, In Proceedings of International WWW Conference (WWW-2004), 2004.]] Google ScholarDigital Library
Song, R., Wen, J.-R., Shi, S., Xin, G., Liu, T.-Y., Qin, T., Zheng, X., Zhang, J., Xue, G., and Ma, W.-Y. Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004. In Proceedings of the Thirteenth Text REtrieval Conference Proceedings (TREC-2004), 2004.]]Google Scholar
Yau, H.S. and Hawker, J.S. SA_MetaMatch: Relevant Document Discovery Through Document Metadata and Indexing, In Proceedings of ACM Southeast Regional Conference 2004, 2004.]] Google ScholarDigital Library
Zhang, M., Song, R., Lin, C., Ma, L., Jiang, Z., Jin, Y., Liu, Y., Zhao, L. and Ma, S. THU at TREC 2002: novelty, web, and filtering. In Proceedings of the Eleventh Text REtrieval Conference (TREC-11), 2002.]]Google Scholar
Zhang, M., Song, R. and Ma, S. DF or IDF? On the use of HTML primary feature fields for Web IR. In Proceedings of the Twelfth International World Wide Web Conference (WWW2003), 2003.]]Google Scholar

Index Terms

Recommendations

Web page title extraction and its application

This paper is concerned with automatic extraction of titles from the bodies of HTML documents (web pages). Titles of HTML documents should be correctly defined in the title fields by the authors; however, in reality they are often bogus. It is ...
Read More
Detecting tables in Web documents

The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this ...
Read More
Separating XHTML content from navigation clutter using DOM-structure block analysis
HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia

This short paper gives an overview of the principles behind an algorithm that separates the core-content of a web document from hyperlinked-clutter such as text advertisements and long links of syndicated references to other resources.Its advantage over ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
August 2005
708 pages
ISBN:1595930345
DOI:10.1145/1076034
General Chairs:
Ricardo Baeza-Yates
University of Chile, Chile
,
Nivio Ziviani
Federal University of Minas Gerais, Brazil
,
Program Chairs:
Gary Marchionini
University of North Carolina, USA
,
Alistair Moffat
University of Melbourne, Australia
,
John Tait
University of Sunderland, UK
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 August 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HTML document
information retrieval
metadata extraction
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 41
  Total Citations
  View Citations
- 1,554
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Title extraction from bodies of HTML documents and its application to web page retrieval

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Web page title extraction and its application

Detecting tables in Web documents

Separating XHTML content from navigation clutter using DOM-structure block analysis